**Patrick Horain · Catherine Achard Malik Mallem (Eds.)**

# **Intelligent Human Computer Interaction**

**9th International Conference, IHCI 2017 Evry, France, December 11–13, 2017 Proceedings**

### Lecture Notes in Computer Science 10688

Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

### Editorial Board

David Hutchison Lancaster University, Lancaster, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Zurich, Switzerland John C. Mitchell Stanford University, Stanford, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Dortmund, Germany Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbrücken, Germany More information about this series at http://www.springer.com/series/7409

Patrick Horain • Catherine Achard Malik Mallem (Eds.)

# Intelligent Human Computer Interaction

9th International Conference, IHCI 2017 Evry, France, December 11–13, 2017 Proceedings

Editors Patrick Horain Telecom SudParis Evry France

Catherine Achard Pierre and Marie Curie University Paris France

Malik Mallem Univ. Evry, Paris Saclay University Evry France

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-319-72037-1 ISBN 978-3-319-72038-8 (eBook) https://doi.org/10.1007/978-3-319-72038-8

Library of Congress Control Number: 2017960864

LNCS Sublibrary: SL3 – Information Systems and Applications, incl. Internet/Web, and HCI

© The Editor(s) (if applicable) and The Author(s) 2017. This book is an open access publication.

Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Printed on acid-free paper

This Springer imprint is published by Springer Nature The registered company is Springer International Publishing AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

### Preface

The international conference on Intelligent Human Computer Interaction (IHCI) is a forum for the presentation of research results and technological advances at the crossroads of human-computer interaction, artificial intelligence, signal processing, and computer vision. It brings together engineers and scientists from around the world focussing on theoretical, practical, and applicational aspects of the field.

The 9th event, IHCI 2017, took place during December 11th–13th, 2017 in Evry, France. The present proceedings consist of papers presented at the conference.

The call for papers attracted 25 submissions from around the world, which have been reviewed by at least two and up to four members of the International Program Committee. Fifteen oral communications have been selected, the authors of which come from ten countries and four continents. The summary of one of the invited talks is also included. We thank all the invited speakers, authors, and members of the Program Committee for their contribution in making IHCI 2017 a stimulating and productive conference.

Finally, we gratefully acknowledge Telecom SudParis, Pierre and Marie Curie University, and Evry Val d'Essonne University for jointly sponsoring the conference. Special thanks go to the Telecom SudParis staff for their assistance and hard work in organizing the conference on campus and providing the logistics.

October 2017 Patrick Horain Catherine Achard Malik Mallem

### Organization

### Program Committee

Rahul Banerjee BITS Pilani, India Amrita Basu JU India, India Samit Bhattacharya IITG - CSE, India Jérôme Boudy Telecom SudParis, France Keith Cheverst Lancaster University, UK Gérard Chollet CNRS, France Partha P. Das IIT KGP, India Alok Kanti Deb IIT Delhi, India Bernadette Dorizzi Telecom SudParis, France David Antonio Gomez Jauregui Nesma Houmani Telecom SudParis, France Ekram Khan AMU, India Geehyuk Lee KAIST, South Korea Atanendu Sekhar Mandal CERRI, Pilani, India Galina Pasko Uformia, Norway Dijana Petrovska Telecom SudParis, France

Catherine Achard Pierre and Marie Curie University, France (Chair) Patrick Horain Telecom SudParis, France (Co-chair) Malik Mallem Univ. Evry, Paris Saclay University, France (Co-chair) Ajith Abraham Machine Intelligence Research Labs, USA Plaban Kumar Bhowmick Indian Institute of Technology Kharagpur, India Thierry Chaminade Institut de Neurosciences de la Timone, France Richard Chebeir Université de Pau et des Pays de l'Adour, France Amine Chellali Univ. Evry, Paris Saclay University, France Mohamed Chetouani Pierre and Marie Curie University, France Laurence Devillers LIMSI, Univ. ParisSud, Paris Saclay University, France Gaël Harry Dias CNRS & University of Caen Basse-Normandie, France Shen Fang Pierre and Marie Curie University, France Tom D. Gedeon Australian National University, Australia Alexander Gelbukh Mexican Academy of Science, Mexico Martin A. Giese CIN/HIH University Clinic Tuebingen, Germany ESTIA, France

Michele Gouiffes LIMSI, Univ. ParisSud, Paris Saclay University, France David Griol Barres Carlos III University of Madrid, Spain José Marques Soares Universidade Federal do Ceará, Brazil Marion Morel Pierre and Marie Curie University, France


### Additional Reviewers


### Sponsors

### Telecom SudParis

Telecom SudParis is a leading public graduate school of engineering in Information and Communication Technologies (ICT). It is part of the Institut Mines Télécom, France's leading group of engineering schools, supervised by the Ministry of Industry. It is part of the Université Paris-Saclay, the first French research cluster in sciences and technologies of information. The 105 full-time professors of Telecom SudParis contribute to the education of 1,000 students including 700 engineers and Master students and more than 150 doctoral students.

### Univ. Evry, Paris Saclay University

The University of Evry-Val d'Essonne was created in 1991 as part of the development of higher education in the Ile-de-France region. It is multidisciplinary and there are more than 160 curricula, over half of which are professionally-oriented. The University offers courses and research in Science, Technology, Law, Economics, Management, and the Social Sciences. It is part of the Université Paris-Saclay, the first French research cluster in sciences and technologies of information. The 500 full-time professors of the university contribute to the education of more than 10,000 students including 3,000 Master students and more than 300 doctoral students.

### Pierre and Marie Curie University

UPMC represents French excellence in science and medicine. A direct descendant of the historic Sorbonne, UPMC is the top French university by the Shanghai world rankings, 7th in Europe, and 36th in the world. UPMC encompasses all major sciences, such as mathematics (5th in the world); chemistry; physics; electronics; computer science; mechanics; Earth, marine, and environmental sciences; life sciences; and medicine.

# Invited Papers

### Optimizing User Interfaces for Human Performance

#### Antti Oulasvirta

School of Electrical Engineering, Aalto University, Espoo, Finland

Abstract. This paper summarizes an invited talk given at the 9th International Conference on Intelligent Human Computer Interaction (December 2017, Paris). Algorithms have revolutionized almost every field of manufacturing and engineering. Is the design of user interfaces the next? This talk will give an overview of what future holds for algorithmic methods in this space. I introduce the idea of using predictive models and simulations of end-user behavior in combinatorial optimization of user interfaces, as well as the contributions that inverse modeling and interactive design tools make. Several research results are presented from gesture design to keyboards and web pages. Going beyond combinatorial optimization, I discuss self-optimizing or "autonomous" UI design agents.

### Simplexity and Vicariance: On Human Cognition Principles for Man-Machine Interaction

Alain Berthoz

Collège de France French Academy of Science and Academy of Technology

Abstract. The study of living bodies reveals that in order to solve complex problems in an efficient, fast and elegant way, evolution has developed processes that are based on principles that are neither trivial nor simple. I called them "simplexes". They concern for example detours, modularity, anticipation, redundancy, inhibition, reduction of dimensionality etc. They often use detours that seem to add an apparent complexity but which in reality simplifies problem solving, decision and action. Among these general principles, "vicariance" is fundamental. It is the ability to solve some problem by different processes according to the capacity of each one, the context, etc. It is also the ability to replace a process by another in the case of deficits. It is also the possibility to create new solutions. Indeed, it is the basis of creative flexibility.

I will give examples borrowed from perception, motor action, memory, spatial navigation, decision-making, relationship with others and virtual worlds. I will show its importance for the compensation of neurological deficits and the design of humanoid robots for example. Finally, I will mention their importance in the fields of learning and education.

### Interpersonal Human-Human and Human-Robot Interactions

Mohamed Chetouani

Pierre and Marie Curie University, Paris, France

Abstract. Synchrony, engagement and learning are key processes of interpersonal interaction. In this talk, we will introduce interpersonal human-human and human-machine interactions schemes and models with a focus on definitions, sensing and evaluations at both behavioral and physiological levels. We will show how these models are currently applied to detect engagement in multi-party human-robot interactions, detect human's personality traits and task learning.

### Contents


### Applications



### Smart Interfaces

### **Optimizing User Interfaces for Human Performance**

Antti Oulasvirta(B)

School of Electrical Engineering, Aalto University, Espoo, Finland antti.oulasvirta@aalto.fi

**Abstract.** This paper summarizes an invited talk given at the 9th International Conference on Intelligent Human Computer Interaction (December 2017, Paris). Algorithms have revolutionized almost every field of manufacturing and engineering. Is the design of user interfaces the next? This talk will give an overview of what future holds for algorithmic methods in this space. I introduce the idea of using predictive models and simulations of end-user behavior in combinatorial optimization of user interfaces, as well as the contributions that inverse modeling and interactive design tools make. Several research results are presented from gesture design to keyboards and web pages. Going beyond combinatorial optimization, I discuss self-optimizing or "autonomous" UI design agents.

### **Talk Summary**

The possibility of mathematical or algorithmic design of artefacts for human use has been a topic of interest for at least a century. Present-day user-centered design is largely driven by human creativity, sensemaking, empathy, and creation of meaning. The goal of computational methods is to produce a full user interface (e.g., keyboard, menu, web page, gestural input method etc.) that is good or even "best" for human use with some justifiable criteria. Design goals can include increases in speed, accuracy, or reduction in errors or ergonomics issues. Computational methods could speed up the design cycle and improve quality. Unlike any other design method, some computational methods offer a greaterthan-zero chance of finding an optimal design. Computational design offers not only better designs, but a new, rigorous understanding of interface design. Algorithms have revolutionized almost every field of manufacturing and engineering. But why has user interface design remained isolated?

The objective of this talk is to outline core technical problems and solution principles in computational UI design, with a particular focus on artefacts designed for human performance. I first outline main approaches to algorithmic user interface (UI) generation. Some main approaches include: (1) use of psychological knowledge to derive or optimize designs [1–3], (2) breakdown of complex design problems to constituent decisions [4], (3) formulation of design problems as optimization problems [5], (4) use of design heuristics in objective functions [6], (5) use of psychological models in objective functions [7,8], (6) data-driven c The Author(s) 2017 P. Horain et al. (Eds.): IHCI 2017, LNCS 10688, pp. 3–7, 2017.

https://doi.org/10.1007/978-3-319-72038-8\_1

methods to generate designs probabilistically, (7) formulation of logical models of devices and tasks to drive the transfer and refinement of designs [9], and (8) learning of user preferences via interactive black-box machine learning methods [10]. I ask: Why is there no universal approach yet, given the tremendous success of algorithmic methods across engineering sciences, and what would a universal approach entail? I argue that successful approaches require solving several hard, interlinked problems in optimization, machine learning, cognitive and behavioral sciences, and design research.

I start with an observation of a shared principle across the seemingly different approaches: The shared algorithmic basis is *search*: "To optimize" is the act and process of obtaining the best solution under given circumstances. Design is about the identification of optimal conditions for human abilities. To design an interactive system by optimization, a number of decisions is made such that they constitute as good whole as possible. What differentiates these approaches is what the design task is, how it is obtained, and how it is solved. Four hard problems open up.

The first problem is the definition of design problems: algorithmic representation of the atomic decisions that constitute the design problem. This requires not only abstraction and mathematical decomposition, but understanding of the designer's subjective and practical problem. I show several definitions for common problems in UI design and discuss their complexity classes. It turns out that many problems in UI design are exceedingly large, too large for trial-anderror approaches. To design an interactive layout (e.g., menu), one must fix the types, colors, sizes, and positions of elements, as well as higher-level properties, such as which functionality to include. The number of combinations of such choices easily gets very large. Consider the problem of choosing functionality for a design: If for *<sup>n</sup>* functions there are 2*<sup>n</sup> <sup>−</sup>* 1 candidate designs, we already have 1,125,899,906,842,623 candidates with only 50 functions, and this is not even a large application.

The second problem is the definition of meaningful objective functions. The objective function is a function that assigns an *objective score* to a design candidate. It formalizes what is assumed to be 'good' or 'desirable' – or, inversely, undesirable when the task is to minimize. In applications in UI design, a key challenge is to formulate objective functions that encapsulate goodness in both designer's and end-users' terms. In essence, defining the objective function "equips" the search algorithm with design knowledge that tells what the designer wants and predicts how users interact and experience. This can be surface features of the interface (e.g., visual balance) or expected performance of users (e.g., 'task A should be completed as quickly as possible'), users' subjective preferences, and so on. However, it is tempting but naive to construct objective function based on heuristics. Those might be easy to express and compute, but they might have little value in producing good designs. It must be kept in mind that the quality of a interface is determined not by the designer, nor some quality of the interface, but by end-users, in their performance and experiences. I argue that an objective function should be essentially viewed as a predictor: a predictor of quality for end users. It must capture some essential tendencies in the biological, psychological, behavioral, and social aspects of human conduct. This fact drives a departure from traditional application areas of operations research and optimization, where objective functions have been based on natural sciences and economics. I discuss the construction of objective function based on theories and models from cognitive sciences, motor control, and biomechanics.

A key issue we face in defining objective functions for interface design is the emergent nature of interaction: the way the properties of the design and the user affect outcomes in interaction unfolds dynamically over a period of time in the actions and reactions of the user. A key issue is people's ability to adapt and strategically change. The way they deploy their capacities in interaction complicates algorithmic design, because every design candidate generated by an optimizer must be evaluated against how users may adapt to it. I discuss approaches from bounded agents and computational rationality toward this end. Computational rationality (CR) [11] assumes an ideal agent performing under the constraints posed by the environment. This assumption yields good estimates in performance-oriented activities, but complicates computation remarkably.

The third problem is posed by algorithmic methods. I discuss trade-offs among modern method, which can be divided into two main classes: (i) heuristics such as genetic algorithms and (ii) exact methods such as integer programming. Exact methods offer mathematical guarantees for solutions. However, they insist on rigorous mathematical analysis and simplification of the objective function, which has been successful in only few instances in HCI this far. Black-box methods, in contrast, can attack any design problem but typically demand empirical tuning of the parameters and offer only approximate optimality. Here the design of the objective function and design task come to fore. The choice of modeling formalism is central, as it determines how design knowledge is encoded and executed, and how interaction is represented.

Fourth is the definition of task instances. In optimization parlance, task instance is the task- and designer-specific parametrization of the design task: "What constitutes a good design in this particular case?" There are two main sources of information when determining a task instance. To capture a *designer's* intention, interactive optimization can be used. Characteristic of interaction design is that the objectives can be under-determined and choices subjective and tacit [12]. The known approaches in design tools can be divided according to four dimensions: (1) interaction techniques and data-driven approaches for specification of a design task for an optimizer, (2) control techniques offered for steering the search process, (3) techniques for selection, exploration and refinement of outputs (designs), (4) level of proactivity taken by the tool, for example in guiding the designer toward good designs (as determined by an objective function). Principled approaches like robust optimization or Bayesian analysis can be used. I discuss lessons learned in this area.

However, the designer may not always be able to report all design-relevant objectives. For a full specification of a design task, one may need to algorithmically elicit what *users* "want" or "can" from digitally monitorable traces. This is known as the inverse modeling problem [13]. I discuss probabilistic methods for cognitive models. These may disentangle among beliefs, needs, capabilities, and cognitive states of users as causes of their observations. Alternatively, black box models can be used. The benefit of white-box models, however, is that they allow the algorithm in some cases to predict the consequences (costs, benefits) of changing a design on user.

To conclude, perhaps the most daring proposition made here is that essential aspects of design, which has been considered a nuanced, tacit, and dynamic activity, can be abstracted, decomposed, and algorithmically solved, moreover in a way that is acceptable to designers. I review empirical evidence comparing computationally to manually designed UIs. However, much work remains to be done to identify scalable and transferable solution principles.

Even more critical is the discussion of what "design" is. Interaction design is characterized as "the process that is arranged within existing resource constraints to create, shape, and decide all use-oriented qualities (structural, functional, ethical, and aesthetic) of a digital artefact for one or many clients" [14]. Some scholars go as far as claiming that interaction design is through-andthrough subjective and experiential [15]. It is about conceptualizing product ideas and designing their behavior from a user's perspective. In this regard, computational methods still cover a limited aspect of design. Transcending beyond optimization, I end with a discussion of what *artificially intelligent UI design* might mean. I claim that "AI for Design" must meet at least five defining characteristics of design thinking: (1) agency, (2) problem-solving, (3) sense-making, (4) speculation, and (5) reflection. So far, no approach exists that – in a unified fashion and with good results – achieves this.

**Acknowledgements.** The work of AO has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No. 637991).

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Geometrical Shapes Rendering on a Dot-Matrix Display**

Yacine Bellik(✉) and Celine Clavel

LIMSI, CNRS, Univ. Paris-Sud, Université Paris-Saclay, Rue John von Neumann, Campus Universitaire d'Orsay, 91405 Orsay cedex, France {Yacine.Bellik,Celine.Clavel}@limsi.fr

**Abstract.** Using a dot-matrix display, it is possible to present geometrical shapes with different rendering methods: solid shapes, empty shapes, vibrating shapes, etc. An open question is then: *which rendering method allows the fastest and most reliable recognition performances using touch?* This paper presents results of a user study that we have conducted to address this question. Using a 60 \* 60 dotmatrix display, we asked 40 participants to recognize 6 different geometrical shapes (square, circle, simple triangle, right triangle, diamond and cross) within the shortest possible time. Six different methods to render the shapes were tested depending on the rendering of shape's outline and inside: static outline combined with static or vibrant or empty inside, and vibrating outline combined with static or vibrant or empty inside. The results show that squares, right triangles, and crosses are more quickly recognized than circles, diamonds, and simple triangles. Furthermore, the best rendering method is the one that combines static outline with empty inside.

**Keywords:** Touch · Dot-matrix display · Graphics · Geometry

### **1 Introduction**

Blind people can have access to digital documents using specific software called "screen readers". Screen readers can present in a linear way, either through speech synthesis or braille, the content of a document or elements of a graphical interface. However, access to graphics and other two-dimensional information is still severely limited for the blind. It is not easy for them to explore 2D structures such as mathematical formulas, maps, electronic circuit diagrams…) using a screen reader. The user is then faced with many problems such as disorientation and difficulty to memorize and to build a correct mental model.

The work presented in this paper is a first step of a larger project that aims at defining new ways for the blind to have access to electronic documents while preserving spatial layout of the document. The main idea of the project is to use a dot-matrix display to present the general spatial layout of the document. Each element of the document struc‐ ture (title, paragraph, image, etc.) will be represented by a geometrical form that will reflect the size and the position of the element in the document. When the user explores this spatial layout, he/she will be able to access to the detailed content of the element that is currently under his/her fingers, through another modality such as speech synthesis or braille.

As a preliminary step, two questions should be addressed. First, which geometrical form should be used? Obviously, using rectangles is the first idea that comes in mind but is it possible to use other forms depending for instance on the information type? Second, which rendering method allows the best and faster recognition process?

### **2 Related Work**

Different methods exist to translate graphical information into a tactile form to make it accessible to a blind person [2, 3]. 3D printing, collage, thermoforming and embossed paper [8] are great for educational purposes but they all have the same drawback: they produce static documents which prevents useful interactive operations such as zooming and scrolling. This leads to a drastic reduction of information density due to the limited resolution of the skin. Furthermore, their quality decreases with use and they require huge space to be stored.

Other devices that allow refreshable tactile display, exist. They can be classified into two main categories. The first category concerns the devices that allow a tactile explora‐ tion of a virtual large surface using a small tactile device. A typical example of such devices is the VTPlayer mouse [9, 10] that can be used as a classical mouse to explore a virtual surface while receiving tactile stimuli through the index finger thanks to its 4 \* 4 Braille dots. The main advantage of this device is its low cost and portability. However, exploration is generally done using only one finger which leads to important time explo‐ ration before achieving recognition even of very simple shapes.

Another similar device is the Tactograph [11, 12]. The Tactograph includes a STReSS2 tactile display (see Fig. 1) [5] which allows the production of a variety of tactile stimuli providing richer rendering of textures using thin strips for stretching the skin of the finger. However, it still allows only a single finger exploration.

**Fig. 1.** (a) Active area of the STReSS<sup>2</sup> tactile display, (b) STReSS2 mounted on a planar carrier, and (c) usage of the device. Extracted from Levesque's website (http://vlevesque.com/papers/ Levesque-HAPTICS08/)

The second category concerns the devices that allow the tactile exploration of a large physical surface using several fingers of both hands [6, 7]. The surface is generally composed by a matrix of a high number of Braille dots which play the same role as pixels in screens. An example of such device is the dot-matrix display designed by Shimada et al. [4] which offers 32 × 48 Braille dots. The main drawback of this kind of devices is their cost.

In this paper, we present a study conducted using a device of this second category to identify the rendering features that allow the fastest and most reliable recognition of geometrical shapes. The protocol of this study was inspired by a study conducted by Levesque and Hayward [1] on a device of the first category (the STReSS<sup>2</sup> device).

### **3 User Study**

For this study, we have used a 3600-dot-matrix display (60 × 60 dots) from metec AG. The display surface is 15 × 15 cm<sup>2</sup> . The dots can be only in 2 states: up or down. The device is presented in Fig. 2. It has also a set of buttons (some of them can be used as a braille keyboard) and a scrollbar.

**Fig. 2.** The dot-matrix display used in the study

### **3.1 Experimental Conditions**

**Shapes.** Six different shapes were used in the experiment. We choose the same shapes as the ones used in [1]. As shown in Fig. 3 these shapes are: square, circle, simple triangle, right triangle, diamond and cross.

**Fig. 3.** The six shapes used in the study

**Size of shapes.** In [1] the shapes were selected to fill a 2 or 3 cm square, leading to two different sizes: *small* and *large*. In our experiment, we used three different sizes: *small*, *medium,* and *large*. Our small and medium sizes correspond respectively to small and large sizes of Levesque's study (2 and 3 cm). Our large size corresponds to a 4-cm bounding square. We added this larger size because the dot-matrix display has less resolution than the STReSS2 tactile display [5]. In the STReSS<sup>2</sup> device, the center-tocenter distance between adjacent actuators is 1.2 × 1.4 mm and the actuators can deflect toward the left or right by 0.1 mm. In our dot-matrix display, the horizontal and vertical distances between the dots centers are the same and are equal to ~2.5 mm. The diameter of each dot is ~1 mm. So, we kept the same sizes as in [1] but added a supplementary (larger) one in case recognition performances would be affected by poorer resolution of the dot-matrix display.

**Rendering of shapes.** Six different rendering methods were used during the experiment depending on the way the outline1 and the inside of the shapes are displayed. Each of these two elements can be rendered in 3 different ways: static, vibrating, empty. The vibration effect is obtained by putting the dots up and down alternatively with a 10 Hz frequency to not damage the device. Theoretically this should lead to 9 different **R**endering **M**ethods (RM) as shown in Table 1.


**Table 1.** Features of different renderings of shape.

However, if we look deeper at these 9 rendering methods, we can see that 3 of them (RM7, RM8, RM9) are not pertinent. RM9 displays nothing since both the outline and the inside of the shape are empty. RM7 represents the same rendering method as RM1 because only the size of the shape is a little smaller if we remove the outline. Similarly, RM8 and RM5 represent the same rendering method for the same reason. Since the size factor is evaluated separately, we have decided to not consider RM7 and RM8. Figure 4 illustrates examples of the 6 RMs that were kept. Note that in [1], only RM1, RM3 and RM6 were used.

**Fig. 4.** The six rendering methods used in the study.

<sup>1</sup> The width of the outline is composed by the width of one dot, so ~1 mm.

### **3.2 Participants**

Data were collected from 40 sighted subjects (31 men and 9 women), aged from 18 to 40 (M age = 23.7; SD age = 5.2). Many participants were people with a computer science background. All participants filled out a background questionnaire, which was used to gather information on personal statistics such as age and education level. Our sample was composed by 34 right-handers and 6 left-handers. All participants were naive with respect to the experimental setup and purpose of the experiment.

### **3.3 Protocol**

First, each participant is invited to sign an informed consent and then an overview of the experiment is provided. The experiment was conducted in two main phases:


During the test, shapes varied according to the geometrical form, the size, and the rendering method. The order of the forms, sizes and rendering methods was randomly generated across participants. In all, each participant had to recognize 324 shapes (6 forms × 3 sizes × 6 rendering methods × 3 sessions).

### **3.4 Measures**

For each shape, we recorded time to recognize it (in milliseconds) and participant's answer. The participants used a button to display/hide the figure (which starts/stops the chronometer). The answers were given verbally. We developed a program to extract dependent variables from the log files that were generated during the test.

### **4 Results**

The results presented in this section are considered statistically significant when p < 0.05. Results are explicitly referred as a "trend" if p is between 0.05 and 0.1. We applied the Shapiro-Wilk test to verify that the variables succeed to satisfy normality assumptions. This is only verified for the recognition time variable. Recognition time was analyzed by means of ANOVAs2 with shape, shape size, and combination of rendering methods of shape's outline and shape's inside. ANOVAs were calculated

<sup>2</sup> Regarding each factor, a one-way ANOVA was conducted for the recognition time.

using Statistica 9. Post hoc comparisons used the Student's t-test. A Chi 2 test was performed for the recognition rate.

#### **4.1 Recognition Rate**

We first analyzed the results by considering all answers given by the subjects and we conducted an analysis of Chi 2. Results show that recognition rate do not vary according to the geometrical form, the size, and the rendering method. The global mean of the recognition rate is 95%. Table 2 provides the detailed percentages of recognition in each category. Chi 2 analysis reveals that shapes are well recognized whatever the geomet‐ rical form, the size, and the rendering.


**Table 2.** Recognition rate according the shape, the size, and the rendering.

#### **4.2 Recognition Time**

**Shape effect.** We observed a main effect of the geometrical shape on the recognition time (*F*(5, 195) = 39,295, *p* < 0,001 see Fig. 5). Post hoc comparisons suggested that participants tended to recognize more quickly crosses, squares, and right triangles than circles, diamonds, and simple triangles. There was no significant difference between crosses, squares, and right triangles. In addition, there was no significant difference between circles, diamonds, and simple triangles.

**Fig. 5.** Recognition time according the geometrical shapes.

**Size effect.** We observed a main effect of the size on the recognition time (*F*(2, 78) = 86,157; *p* < 0,001). Post hoc comparisons suggested that participants recog‐ nized more slowly small shapes (*Mean* = 6219,87; *SD* = 1996,642) than medium (*Mean* = 5300,23; *SD* = 63838,58) or large shapes (*Mean* = 5242,80; *SD* = 1674,63). There was no significant difference between the medium and large shapes.

**Rendering method effect.** We observed a main effect of the rendering method on the recognition time (*F*(5, 195) = 73,237, *p* < 0,001 see Fig. 6). Post hoc comparisons suggested that the best configuration is when the rendering method combines static

**Fig. 6.** Recognition time according the combination of inside and outline rendering.

outline with empty inside. Participants recognize faster the shapes with this configura‐ tion compared to other configurations. In addition, post hoc comparisons suggested that the worst configuration is when the rendering method combines vibrating outline with vibrating inside. Participants recognize more slowly the shapes with this configuration compared to other configurations. Finally post comparisons suggested that the recogni‐ tion time varies according to the combination of rendering methods.

### **5 Discussion**

The previous section revealed three important results.


**Fig. 7.** Recognition rate comparison between the STReSS<sup>2</sup> and the dot-matrix display according to shapes.

However, and even though the following results should be taken with care due to different experimental conditions, a comparison with Levesque and Hayward results shows that recognition of geometrical shapes is better with a dot-matrix display than with a STReSS<sup>2</sup> device in all cases. Figures 7, 8 and 9 show the comparison between our average recognition rates (Dot-Matrix) and theirs (STReSS2 ) depending respec‐ tively, on shapes, sizes, and rendering methods.

**Fig. 8.** Recognition rate comparison between the STReSS<sup>2</sup> and the dot-matrix display according to size.

**Fig. 9.** Recognition rate comparison between the STReSS<sup>2</sup> and the dot-matrix display according to rendering method.

Concerning the recognition time, Levesque and Hayward found that recognition was performed in 14,2 s on average, while in our study, recognition is performed in 5,6 s on average (2,5 × faster).

### **6 Conclusion**

This article explored several haptic rendering methods to present geometrical shapes through the touch using several fingers on a large physical surface: the dot-matrix display. The presented study allowed us to collect 12960 recognition times and 12960 recognition scores (324 shapes × 40 participants). Results show that the best rendering method is the one that combines static outline with empty inside and that squares, right triangles, and crosses are more quickly recognized than circles, diamonds, and simple triangles. These results are interesting for our project concerning spatial access to docu‐ ments by the blind.

The protocol of the presented study was inspired by a similar study conducted by Levesque and Hayward on a smaller device that allows exploring a virtual surface using only one finger: the STReSS2 device. The comparison of results shows that the recog‐ nition rates and times on a dot-matrix display are better in all cases. However, further investigations are needed to determine if this is due to mono-finger vs multi-finger exploration or for other reasons.

Next step of this work will be to reproduce the same experiment with visually impaired people. It would be also interesting to study the effects of different vibration frequencies and different outline widths as well to compare the performances of the dotmatrix display with those of a vibrotactile device such as in [13].

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Dynamic Hand Gesture Recognition for Mobile Systems Using Deep LSTM**

Ayanava Sarkar<sup>1</sup>, Alexander Gepperth2(B), Uwe Handmann<sup>3</sup>, and Thomas Kopinski<sup>4</sup>

<sup>1</sup> Computer Science Department, Birla Institute of Technology and Science, Pilani, Dubai Campus, Dubai, UAE <sup>2</sup> Computer Science Department, University of Applied Sciences Fulda,

Fulda, Germany

alexander.gepperth@cs.hs-fulda.de

<sup>3</sup> Computer Science Department, University of Applied Sciences Ruhr West,

M¨ulheim, Germany

<sup>4</sup> Computer Science Department, University of Applied Sciences South Westphalia, Iserlohn, Germany

**Abstract.** We present a pipeline for recognizing dynamic freehand gestures on mobile devices based on extracting depth information coming from a single Time-of-Flight sensor. Hand gestures are recorded with a mobile 3D sensor, transformed frame by frame into an appropriate 3D descriptor and fed into a deep LSTM network for recognition purposes. LSTM being a recurrent neural model, it is uniquely suited for classifying explicitly time-dependent data such as hand gestures. For training and testing purposes, we create a small database of four hand gesture classes, each comprising 40 *×* 150 3D frames. We conduct experiments concerning execution speed on a mobile device, generalization capability as a function of network topology, and classification ability 'ahead of time', i.e., when the gesture is not yet completed. Recognition rates are high (>95%) and maintainable in real-time as a single classification step requires less than 1 ms computation time, introducing freehand gestures for mobile systems.

**Keywords:** Mobile computing *·* Gestural interaction *·* Deep learning

# **1 Introduction**

Gestures are a well-known means of interaction on mobile devices such as smart phones or tablets up to the point that their usability is so well-integrated into the interface between man and machine that their absence would be unthinkable. However, this can only be stated for touch gestures as three-dimensional or freehand gestures have to yet find their way as a means of interaction into our everyday lives. While freehand gestures are steadily being included as an additional means of control in different various fields (entertainment industry, infotainment systems in cars), within the domain of mobile devices a number of limitations present obstacles to be overcome in order to make this an unequivocally seamless interaction technique.

First and foremost, data has to be collected be in an unobtrusive manner, hence no sensors attached to the user's body can be utilized. As mobile devices have to remain operable independent of the user's location the number of employable technologies is drastically reduced. Eligible sensor technology is mainly limited to Time-of-Flight (TOF) technology as it is not only capable to provide surrounding information independent of the background illumination but moreover can do so at high frame rates. This is the presupposition to realize an interface incorporating freehand gesture control as it allows for the system's reaction times to remain at a minimum. TOF technology has to yet be established as a standard component in mobile devices (as e.g. in the Lenovo PHAB2 Pro) and it moreover suffers from a comparatively small resolution, potentially high noise and heat development. Despite these drawbacks it is a viable choice since the benefits outweigh the disadvantages as will be presented in this contribution. Realizing freehand gestures as an additional means of control not only overcomes problems such as usage of gloves or the occlusion of the screen interface during touch gesture interaction. It moreover also allows for increased expressiveness (with additional degrees of freedom) which in turns allows for a completely new domain of novel applications to be developed (especially in the mobile domain). This can be corroborated by the fact that car manufacturers, which have always been boosting innovations by integrating new technologies into the vehicle, have recently begun incorporating freehand gestures into the vehicle interior (e.g. BMW, VW etc.). The automotive environment faces the same problems such as stark illumination variances, but on the other hand can compensate difficulties such as high power consumption.

In this contribution we present a light-weight approach to demonstrate how dynamical hand gesture recognition can be achieved on mobile devices. We collect data from a small TOF sensor attached to a tablet. Machine Learning models are created by training from a dynamic hand gesture data base. These models are in turn used to realize a dynamic hand gesture recognition interface capable of detecting gestures in real-time.

The approach presented in this contribution can be set apart from other work in the field of Human Activity Recognition (HAR) by the following aspects: We utilize a single TOF camera in order to retrieve raw depth information from the surrounding environment. This allows for high frame rate recordings of nearby interaction while simultaneously making the retrieved data more robust vs. nearby illumination changes. Moreover, our approach is viable using only this single sensor, in contrast to other methodology where data coming from various kinds of sources is fused. Furthermore, data acquired in a non-intrusive manner allows for full expressiveness in contrast to data coming from sensors attached to the user's body. The process as a whole is feasible and realizable in real-time insofar as that once the model is generated after training, it can be simply transferred onto a mobile device and utilized with no negative impact on the device's performance. The remaining sections are organized as follows: Work presented in this contribution is contrasted to state of the art methodology within the domain of dynamic freehand gesture recognition (Sect. 1.1). The Machine Learning models are trained on a database described in Sect. 2.1. Data sample/s are transformed and presented to the LSTM models in the manner outlined in Sect. 2.2. The LSTM models along with the relevant parameters are subsequently explained in Sect. 2.3. The experiments implemented in this contribution are laid out in Sect. 3 along with the description of the parameter search (Sect. 3.1) and model accuracy (Sect. 3.3). The resulting hand gesture demonstrator is explained in Sect. 5 along with an explanation of its applicability. Section 6 sums up this contribution as a whole and provides a critical reflection on open questions along with an outlook on upcoming future work.

#### **1.1 Dynamic Hand Gesture Detection - An Overview**

Recurrent Neural Networks (RNNs) are employed for gesture detection by fusing inputs coming from raw depth data, skeleton information and audio information [4]. Recall (0.87) and Precision rates (0.89) peak, as expected, when information is fused from all three channels. The authors of [5] present DeepConvLSTM, a deep architecture fusing convolutional layers and recurrent layers from an LSTM for Human Activity Recognition (HAR). Data is provided by attaching several sensors to the human body and therewith extracting accelerometric, gyroscopic and magnetic information. Again, recognition accuracy improves strongly as more data is fused. Their approach demonstrates how HAR can be improved with the utilization of LSTM as CNNs seem not to be able to model temporal information on their own. The authors of [6] utilize BLSTM-RNNs to recognize dynamic hand gestures and compare this approach to standard techniques. However, again body-attached sensors are employed to extract movement information and results are comparatively low regarding the fact that little noise is present during information extraction. No information is given with regard to execution time raising the question of real-time applicability.

### **2 Methods**

#### **2.1 The Hand Gesture Database**

Data is collected from a TOF sensor at a resolution of 320 *×* 160 pixels. Depth thresholding removes most of the irrelevant background information, leaving only hand and arm voxels. Principal-Component Analysis (PCA) is utilized to crop most of the negligible arm parts. The remaining part of the point cloud carries the relevant information, i.e., the shape of the hand. Figure 1 shows the color-coded snapshot of a hand posture.

**Fig. 1.** Data and data generation. Left: Sample snapshot of a resulting point cloud after cropping from the front (left) and side view (right) during a grabbing motion. The lower snapshot describes the hand's movement for each viewpoint (left and right respectively). Right: The Setup - tablet with a picoflexx (indicated with yellow circle). (Color figure online)

We recorded four different hand gestures from a single person at one location for our database: close hand, open hand, pinch-in and pinch-out. The latter gestures are performed by closing/opening two fingers. For a single dynamic gesture recording, 40 consecutive snapshots (no segmentation or sub-sampling) are taken from the sensor and cropped by the aforementioned procedure. In this manner, 150 gesture samples at 40 frames per gesture are present per class in the database, summing up to a total of 24.000 data samples.

#### **2.2 From Point Clouds to Network Input**

Description of a point cloud usually is implemented by so-called descriptors which, in our case, need to describe the phenomenology of hand, palm and fingers in a precise manner at a certain point in time. The possibilities of describing point cloud data are confined to either utilizing some form of convexity measure or calculating the normals for all points in a cloud. Either way, it has to remain computationally feasible in order to maintain real-time capability. In this contribution, the latter methodology is implemented: for a single point cloud, the normals for all points are calculated. Then, for two randomly selected points in a cloud, the PFH metric is calculated [7,8]. This procedure is repeated for up to 5000 randomly selected point pairs extracted from the cloud. Each computation results in a descriptive value which in turn is binned into a 625-dimensional histogram. Therefore, one such histogram provides a description of a single point cloud snapshot at a single point in time. These histograms form the input for training and testing the LSTM models.

#### **2.3 LSTM Model for Gesture Recognition**

In our model for dealing with the video frames sequentially, we use a deep RNN with LSTM model neurons, where the LSTM term for neurons is "memory cell" and the term for hidden layer is "memory cell". At the core of each memory cell is a linear unit supported by a single self-recurrent connection whose weight is initialized to 1.0. Thus, in the absence of any other input, this self-connection serves to preserve the cell's current state from one moment to the next. In addition to the self-recurrent connection, cells also receive input from input units and other cell and gates. The key component of a LSTM cell inside the memory block is its cell state, referred to as <sup>C</sup>*<sup>t</sup>* or the cell state at time step t. This cell state remains unique for a cell and any change to the cell state is done with the help of gates - input gate, output gate and the forget gate. The output of the gates is a value between 0 and 1, with 0 signifying not "let anything through the gate" and 1 signifying "let everything through the gate". The input gate determines how much of the input to be forwarded to the cell, then the forget gate calculates how much of the cell's previous state to keep depending on how much to let the input affect the cell state, thus, the extent to which a value remains in the cell state and finally, the output gate computes the output activation, thereby, determining how much of the activation of the cell to be output.

At a time step <sup>t</sup>, the input to the network is <sup>x</sup>*<sup>t</sup>* and <sup>h</sup>*t*−<sup>1</sup>, where the former is the input and the latter is the output at time step t *−* 1. For the first time step, the <sup>h</sup>*t*−<sup>1</sup> is taken to be 1.0. In the hidden layers or the memory blocks, the output of one memory block forms the input to the next block. The following are the equations revolving around the inner complexities of an LSTM model, where W refers to the weights, b refers to the biases and the σ refers to the sigmoidal function, outputting a value between 0 and 1:

$$i\_t = \sigma(W\_{ix}x\_t + W\_{ih}h\_{t-1} + b\_i) \tag{1}$$

Equation 1 refers to the calculation of the input gate. Final output of the input gate is a value between 0 and 1.

$$f\_t = \sigma(W\_{fx}x\_t + W\_{fh}h\_{t-1} + b\_f) \tag{2}$$

Equation 2 refers to the calculation of the forget gate. Final output of the forget gate is a value between 0 and 1.

$$o\_t = \sigma(W\_{ox}x\_t + W\_{oh}h\_{t-1} + b\_o) \tag{3}$$

Equation 3 refers to the calculation of the output gate. Final output of the output gate is a value between 0 and 1.

$$g\_t = \tanh(W\_{gx}x\_t + W\_{gh}h\_{t-1} + b\_g) \tag{4}$$

Equation <sup>4</sup> refers to the calculation of <sup>g</sup>*<sup>t</sup>* that gives a value between *<sup>−</sup>*1 and 1, specifying the amount of importance of the input that is relevant to the cell state, where the tanh function outputs a value between *<sup>−</sup>*1 and 1. Here, <sup>g</sup>*<sup>t</sup>* refers to the new candidate values that must be added to the existing or the previous cell state.

$$c\_t = f\_t c\_{t-1} + i\_t g\_t \tag{5}$$

Equation 5 refers to the calculation of the new cell state, replacing the old one.

$$h\_t = \tanh(c\_t) o\_t \tag{6}$$

Equation 6 refers to the calculation of the hidden state or the output of that particular memory block, which then serves as the input to the next memory block. The tanh function allows it to output a value between *−*1 and 1. Further information about these equations can be found in [1].

The final output of the LSTM network is produced by applying a linear regression readout layer that transforms the states <sup>C</sup>*<sup>t</sup>* of the last hidden layer into class membership estimates, using the standard softmax non-linearity leading to positive, normalized class membership estimates.

### **3 Experiments and Observations**

The implementation has been done in TensorFlow using Python. There is a total 150 video files for each of the 4 classes of hand gestures. The model is trained on <sup>N</sup>*tr* = 480 total samples, with 120 samples belonging to each of the 4 classes of hand gestures. The model is then evaluated using a total of <sup>N</sup>*te* = 120 samples, with 30 samples belonging to each of the 4 classifying classes. The three parts of the experiment adhere to this partitioning of the data. In our implementation, each gesture is represented by a tensor of 40 *×* 625 numbers, while the input of the deep LSTM network corresponds to the dimension of a single frame, that is 625 numbers.

#### **3.1 Model Parameters**

Network training was conducted using the standard tools provided by the TensorFlow package, namely the Adam optimization algorithm [2,3]. Since the performance of our deep LSTM network depends strongly on network topology and the precise manner of conducting the training, we performed a search procedure by varying the principal parameters involved here. These are given in Table 1, as well as the range in which they were varied.


**Table 1.** Principal parameter for network topology and training. The last column indicated the range of values that were exhaustively tested for these parameters.

#### **3.2 Deep LSTM Parameter Search**

Initially with B = 2, 5, 10, M is varied from 1 to 4 for each value of B, C is varied with 128, 256 and 512 for each value of B and M, and I has been varied between 100, 500 and 1000 for each value of the other three parameters. The learning rate is kept constant at 0.0001. Thus, for all combinations of the B, M, C and I, a total of 108 experiments has been carried out.

Now let the predictions for each sample data entered into the model be denoted by *Pi* , where *i* refers to the index of the sample data in the test data. *Pi* is calculated for all frames of a test sample <sup>i</sup>, where the prediction obtained at the last frame defines *Pi* . It is also possible to consider *Pi* for frames <sup>&</sup>lt;40, achieving ahead-of-time guesses at the price of potentially reduced accuracy. *Pi* is a vector of length 4, since there are 4 classes for classification in the experiment. We take the argmax of these 4 elements to indicate the predicted class as shown in Eq. 7. Now to test if the prediction is correct or not it is compared with the label of the data sample, l*<sup>i</sup>*.

$$
\tilde{p}\_i = \text{argmax}(\mathbf{P}\_i) \tag{7}
$$

$$\xi = 100 \frac{\#(\tilde{p}\_i = l\_i)}{N\_{te}} \tag{8}$$

$$\xi = \begin{array}{ccccc} \ddots & \ddots & \ddots & \ddots & \ddots & \ddots \end{array}$$

Equation 8 refers to the simple formula used for calculating the accuracy.

#### **3.3 Measuring Accuracy as a Function of Observation Time**

In the second part of our experimentation, we train the model similar to Sect. 3.2, however in the testing phase, we calculate the predictions at different in-gesture time steps (frames) <sup>t</sup>. Let *Pi,t* denote the prediction for sample i at t < 40. In order to obtain an understanding of how the prediction varies more frames are processed, we calculate the predictions *Pi,t* at time steps t <sup>=</sup> *{*10, <sup>20</sup>, <sup>25</sup>, <sup>30</sup>, <sup>39</sup>, <sup>40</sup>*}*. Here, we perform class-wise analysis to determine which classes lend themselves best to ahead-of-time "guessing" which can be very important in practice.

### **3.4 Speedup and Optimization of the Model**

The implementation shown so far is focused on accuracy alone. Since mobile devices in particular lack faster and more capable processing units, the aim of this part of the article is to speed-up gesture recognition as much as possible by simplifying the LSTM model, if possible without compromising its accuracy. To this end, B has been kept constant at 2, while M is taken to be 1 in all the experiments. The number of memory cells in the single memory block is taken as either 8 or 10. Now, with such a small network, we are able to greatly speed up the system as well as minimize the computation complexities involved regarding the entire model.

# **4 Experimental Results**

### **4.1 Deep LSTM Parameter Search**

With the 108 experiments conducted by varying B, M, C and I, 20 accuracies have been reported in Table 2, with the idea of covering the diversity of the experimental setup of 108 experiments.

From the observations, it can be concluded that for a given M, C and I, the accuracy improves with the increase in the value of B. Thus, B = 10 will have a greater accuracy on the test data as compared to B = 2 or B = 5. This can be explained by the fact that for a given I, the model undergoes a total of (I <sup>X</sup> B) times of training in this experimental setup. Thus, as the number of B increases, so does the value (I <sup>X</sup> B) and consequently the accuracy of prediction. Now, for a given B, C and I, if M is varied between 1 to 4, it has been observed that with the increase in the number of hidden layers or M, the accuracy of prediction improves significantly. This is because, as the number of layers increases, the network becomes more complex with the ability to take into account more complex features from the data and hence, account for more accurate predictions. Similarly, when keeping B, M and I constant and varying C between 128, 256 and 512, we observe that accuracy increases with the increase in the number of memory cells in each memory block, thereby bearing a directly proportional relationship. Similar results were observed when I is varied, keeping

**Table 2.** Results for exhaustive parameter search in topology space. In total, we conducted 108 experiments by varying the network topology and training parameters. The best 18 results are shown here. The column headings correspond to the symbols defined in Table 1.


**Fig. 2.** Left: accuracy of prediction of a single test data sample, with B = 2, M = 1, C = 512 and I = 1000, at different in-gesture time steps t. Right: accuracy of prediction (taken at the end of a gesture) depending on training iterations for a small LSTM network size.

B, M and C as constant parameters, which can be explained by the fact that the model has more time or iterations to adjust its weight in order to bring about the correct prediction. Further, Table 2 shows the accuracies for the different combinations of the network parameters.

#### **4.2 Measuring Accuracy as a Function of Observation Time**

In this part we calculate different quality measures as a function of the frame t they are obtained. The graph in Fig. 2 shows that as the number of time steps increases, the accuracy increases until the maximum accuracy is reached in the last 5 time steps. Furthermore, we can also evaluate the confidence of each classification: as classification of test sample i is performed by taking the argmax of the network output *Pi* , the confidence of this classification is related to max *Pi* . We might expect that the confidence of classification increases with t < 40 as well as more frames have been processed for higher t. Now, Fig. 3a and b depicts the average maxima plus standard deviations (measured on test data) as a function of their class. We observe that, in total coherence to the increase in accuracy over in-gesture time t, the certainty of predictions increases as well, although we observe that this is strongly depending on the individual classes, reflecting that some classes are less ambiguous than others.

#### **4.3 Speedup and Optimization of the Model**

We observe that as the size of the network is greatly reduced comprising a single memory block and the number of memory cells being either 8 or 10, the accuracy is not as great as observed in Sect. 4.1. Hence, in order to accomplish the same level of accuracy as obtained in Sect. 4.1, the number of iterations for the training process was increased. The performances can be referred to in Fig. 2, showing

(a) Average and standard deviations of prediction maxima plotted against in-gesture time for classes 1 and 2

(b) Average and standard deviations of prediction maxima plotted against in-gesture time for classes 3 and 4

**Fig. 3.** "Ahead of time" classification accuracy for classes 1 and 2 (left) as well as 3 and 4 (right).

that 100% accuracy can be achieved even with small networks, although training time (and thus the risk of overfitting) increases strongly.

# **5 System Demonstrator**

#### **5.1 Hardware**

The system setup consists of a Galaxy Notepro 12.2 Tablet running Android 5.02. A picoflexx TOF sensor from PMD technologies is attached to the tablet via USB. It has an IRS1145C Infineon 3D Image Sensor IC chip based on pmd intelligence which is capable of capturing depth images with up to 45 fps. VCSEL illumination at 850 nm allows for depth measurements to be realized within a range of up to 4 m, however the measurement errors increase with the distance of the objects to the camera therefore it is best suited for near-range interaction applications of up to 1 m. The lateral resolution of the camera is 224 *×* 171 resulting in 38304 voxels per recorded point cloud. The depth resolution of the picoflexx depends on the distance and with reference to the manufacturer's specifications is listed as 1% of the distance within a range of 0.5–4 m at 5 fps and 2% of the distance within a range of 0.1–1 m at 45 fps. Depth measurements utilizing ToF technology require several sampling steps to be taken in order to reduce noise and increase precision. As the camera allows several pre-set modes with a different number of sampling steps we opt for 8 sampling steps taken per frame as this resulted in the best performance of the camera with the lowest signalto-noise ratio. This was determined empirically in line with the positioning of the device. Several possible angles and locations for positioning the camera are thinkable due to its small dimensions of 68 mm *×* 17 mm *×* 7.25 mm. As we want to setup a demonstrator to validate our concept the exact position of the camera

**Fig. 4.** Graph plotting the time required to crop the hand and reduce the number of relevant voxels with respect to the number of total points in the cloud.

is not the most important factor however should reflect a realistic setup. In our situation we opted for placing it at the top right corner when the tablet is placed in a horizontal position on the table. However, it should be stated here that any other positioning of the camera would work just as well for the demonstration presented in this contribution.

#### **5.2 System Performance**

One classification step of our model takes about [1.6e*−*05, 3.8e*−*05] of computation time (in s). As Fig. 4 indicates, the time required to crop the cloud to its relevant parts is linearly dependent on the number of points within the cloud.

This is the main bottleneck of our approach as all other steps within the pipeline are either constant factors or negligible w.r.t. computation time required. During real-time tests our systems achieved frame rates of up to 40 fps.

### **6 Conclusion**

We presented a system for real-time hand gesture recognition capable of running in real time on a mobile device, using a 3D sensor optimized for mobile use. Based on a small database recorded using this setup, we prove that high speed and an excellent generalization capacity are achieved by our combined preprocessing+deep RNN-LSTM approach. As LSTM is a recurrent neural network model, it can be trained on gesture data in a straightforward fashion, requiring no segmentation of the gesture, just the assumption of a maximal duration corresponding to 40 frames. The preprocessed signals are fed into the network frame by frame, which has the additional advantage that correct classification is often achieved before the gesture is completed. This might make it possible to have an "educated guess" about the gesture being performed very early on, leading to more natural interaction, in the same way that humans can anticipate the reactions or statements of conversation partners. In this classification problem, it is easy to see why "ahead of time" recognition might be possible as the gestures differ sufficiently from each other from a certain point in time onwards.

A weak point of our investigation is the small size of the gesture database which is currently being constructed. While this makes the achieved accuracies a little less convincing, it is nevertheless clear that the proposed approach is basically feasible, since multiple cross-validation steps using different train/test subdivisions always gave similar results. Future work will include performance tests on several mobile devices and corresponding optimization of the used algorithms (i.e., tune deep LSTM for speed rather than for accuracy), so that 3D hand gesture recognition will become a mode of interaction accessible to the greatest possible number of mobile devices.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Adjustable Autonomy for UAV Supervision Applications Through Mental Workload Assessment Techniques**

Federica Bazzano1(B) , Angelo Grimaldi<sup>1</sup>, Fabrizio Lamberti<sup>1</sup>, Gianluca Paravati<sup>1</sup>, and Marco Gaspardone<sup>2</sup>

<sup>1</sup> Dip. di Automatica e Informatica, Politecnico di Torino, Corso Duca degli Abruzzi, 24, 10129 Turin, Italy {federica.bazzano,angelo.grimaldi,fabrizio.lamberti, gianluca.paravati}@polito.it <sup>2</sup> TIM JOL Connected Robotics Applications LaB, Corso Montevecchio 71, 10129 Turin, Italy marco.gaspardone@telecomitalia.it

**Abstract.** In recent years, unmanned aerial vehicles have received a significant attention in the research community, due to their adaptability in different applications, such as surveillance, disaster response, traffic monitoring, transportation of goods, first aid, etc. Nowadays, even though UAVs can be equipped with some autonomous capabilities, they often operate in high uncertainty environments in which supervisory systems including human in the control loop are still required. Systems envisaging decision-making capabilities and equipped with flexible levels of autonomy are needed to support UAVs controllers in monitoring operations. The aim of this paper is to build an adjustable autonomy system able to assist UAVs controllers by predicting mental workload changes when the number of UAVs to be monitored highly increases. The proposed system adjusts its level of autonomy by discriminating situations in which operators' abilities are sufficient to perform UAV supervision tasks from situations in which system suggestions or interventions may be required. Then, a user study was performed to create a mental-workload prediction model based on operators' cognitive demand in drone monitoring operations. The model is exploited to train the system developed to infer the appropriate level of autonomy accordingly. The study provided precious indications to be possibly exploited for guiding next developments of the adjustable autonomy system proposed.

**Keywords:** Adjustable autonomy · Mental workload Supervisory control · Decision-making system

Work reported has been partially funded by TIM JOL Connected Robotics Applications LaB (CRAB).

c The Author(s) 2017

P. Horain et al. (Eds.): IHCI 2017, LNCS 10688, pp. 32–44, 2017. https://doi.org/10.1007/978-3-319-72038-8\_4

### **1 Introduction**

In recent years, the field of aerial service robotics applications has seen a rapidly growing interest in the development of Unmanned Aerial Vehicles (UAVs) equipped with some autonomous capabilities. However, since UAVs often operate in high uncertainty and dynamic scenarios characterized by unpredictable failures and parameter disturbances, no totally-autonomous control system has emerged yet [1]. Supervisory systems including human in the control loop are required to both monitor UAV operations and assist UAV controllers when critical situations occur [2,3].

Systems equipped with flexible levels of autonomy (LOAs) and decisionmaking capabilities in uncertain environments may be exploited to dynamically allocate human-machine functions by discriminating situations where operators' skills are sufficient to perform a given task from situations where system suggestions or interventions may be required [4–6]. The assessment of operator multitasking performance as well as the level of his/her mental effort for monitoring UAVs, generally termed as "*cognitive or mental workload*" [7], may be used to determine which LOA is needed for the system.

By leveraging the above considerations, this paper reports on the activities that have been carried out at Politecnico di Torino and at TIM JOL Connected Robotics Applications LaB (CRAB) to develop, through an assessment of humans' mental workload, an adjustable autonomy system equipped with some decision-making capabilities in UAV-traffic monitoring scenarios. The system, later referred to as "*control tower*", was devised to autonomously infer the appropriate level of autonomy by exploiting a mental workload prediction model built on operators' cognitive demand in monitoring a growing number of UAVs with an increasing level of risk.

A simulation framework was developed to reproduce both swarm of autonomous drones flying in a 3D virtual urban environment and critical conditions they could be involved into. Afterwards, a user interface showing the 2D map of the city was developed to both display drones' positions and drones' flight information and allow human operators to monitor and intervene when critical conditions occur. A Bayesian Network (BN) classifier was exploited in this work to build the mental workload prediction model described above. This classifier was also leveraged as learning probabilistic model due to its capability to solve decision problems under uncertainty [8].

A user study was carried out with several volunteers, who were asked to perform some supervision and monitoring tasks of a variable number of drones with a growing level of risk. During each experiment, participants were asked to evaluate their perceived mental workload in order to train the system developed inferring the appropriate level of autonomy accordingly.

The rest of the paper is organized as follows. In Sect. 2, relevant literature in the area of adaptive autonomy systems is reviewed. In Sect. 3, the architecture of the system proposed in this study is described. Section 4 provides an overview of the user interface exploited in this study. Section 5 introduces the methodology that has been adopted to perform the experimental tests and discusses results obtained. Lastly, Sect. 6 concludes the paper by providing possible directions for future research activities in this field.

## **2 Related Work**

Many studies in aerial robot applications domain have investigated the evaluation and classification of cockpit operator's workload.

A number of studies have revealed the advantages in exploiting dynamic function allocations for managing operator workload and maintaining him or her focused in control loops [9,10]. In literature, several criteria have been investigated to evaluate human's cognitive load. The main measurement techniques have been historically classified into three categories: physiological, subjective, and performance-based [11]. Different techniques for mental workload assessment and classification have been proposed in this field.

Many research studies have focused on physiological measurements for assessing operator cognitive load in real time. For instance, Scerbo et al. [12] proposed the EEG power band ratios as example of workload measurement in adaptive automation. Wilson et al. [13] exploited EEG channels, electrocardiographic (ECG), electrooculographic (EOG), and respiration inputs as cognitive workload evaluation and an Artificial Neural Network (ANN) as classification methodology. Magnusson [14] examined the pilots' Heart Rate (HR), Heart Rate Variability (HRV), and eye movements in simulator and real flight.

Despite these studies have provided evidences in merging more than one physiological measurements to improve the accuracy of workload classification [13,15], such approaches have proved to be very infeasible from a measurement perspective, affected by the emotional state of the operator and impractical in aircraft cockpits application due to the need of wearing different devices at the same time [16].

In parallel to these studies, other approaches were investigated involving physiological measures in combination with other classes of workload assessment techniques. As a matter of examples, in [8] the authors performed operator's workload evaluation in piloting a flying aircraft by using EEG signal with NASA-TLX questionnaire as subjective measure and a Bayesian Network as classification method. Di Nocera et al. in [17] have investigate operator's workload evaluation engaged in simulated flight employing the eye fixations measure and NASA-TLX questionnaire as assessment methodology and Nearest Neighbor algorithm (NN) as classification method. In [16], the authors investigated different classes of cognitive workload measures by merging cardiovascular activity and secondary task performance (a performance-based technique), as inputs to an Artificial Neural Network (ANN) for operator cognitive state classification during a simulated air traffic control task.

Based on the short but representative review above, it can be observed that the panorama of mental workload assessment and classification techniques in aerial robotics applications is quite heterogeneous. By taking into account advantages and drawbacks of the above solutions, the system proposed in this paper combines subjective workload assessment techniques with a probabilistic Bayesian Network classifier to support UAV controllers in monitoring operations by autonomously inferring the appropriate LOA for the specific situation.

# **3 Proposed System**

In the following, the adjustable autonomy system will be introduced, by providing also some implementation details.

### **3.1 Architecture Overview**

The Adjustable Autonomy System Architecture (AASA) implementing the basic idea inspiring the present paper is illustrated in Fig. 1. It consists of three main components: *UAVs Simulator* (left), *Bandwidth Simulator* (right) and *Adjustable Autonomy Control Tower* (down). More specifically, the *UAVs Simulator* is the block devoted to load the 3D urban environment and execute the 3D drones flight simulation in it. A 3D physics engine was also exploited to test different flying scenarios in conditions as similar as possible to a realistic environment. The *Bandwidth Simulator* block was used to reproduce the network transmission rate of the simulated city. Since drones communicate or send information through the network, a low bandwidth connection could lead to critical conditions for UAV controllers. The *Adjustable Autonomy Control Tower* hosts *Alert* and *Decision* modules. The former determines the state for each drone by mapping the set of information collected by *UAVs* and *Bandwidth Simulators*, i.e., drones' battery level, their distance from obstacles, with different levels of risk, later referred to as *"Alert"*. Three different levels are used to discriminate the drone's level of risk, namely: *"Safe"*, *"Warning"* and *"Danger"*. The latter is responsible for establishing the appropriate level of autonomy by elaborating both the operator's mental workload and his performances via the *"Alert"* level of each drone.

**Fig. 1.** Adjustable autonomy system architecture.

#### **3.2 UAVs Simulator**

The *UAVs Simulator* is the module responsible for performing the 3D drones' simulation in an urban environment. It consists of three different modules namely *Autopilot*, *Physics Simulation* and *Ground Control Station (GCS)*.

The *Autopilot* module contains the flight software allowing drones to fly stable during the flight. More specifically, the Software-In-The-Loop (SITL)<sup>1</sup> simulator was exploited to run the UAV flight code without any specific hardware. Within this simulation tool, the un-compiled autopilot code, which normally runs on the drone's onboard computer, is compiled, simulated and run by the SITL simulation software itself. In the specific case, the SITL software was used to run the PX4 Autopilot Flightcode<sup>2</sup>, an open source UAV firmware of a wide range of vehicle types.

The *Physics Simulation* module is responsible for replicating the real world physics of drones' flight. In this work, Gazebo<sup>3</sup> was exploited as a real-time physics engine in order to emulate the 3D models of UAVs, their physic properties and constraints and their sensors (e.g. laser, camera) in a 3D urban environment. Gazebo runs on Robot Operating System (ROS)<sup>4</sup>, which is a software framework developed for performing robotics tasks.

The *Ground Control Station (GCS)* module contains the software needed to setup drones' starting GPS locations, get real-time flight information, plan and execute drones' missions. The communication between the PX4 Autopilot Flightcode and the GCS module is provided by the Micro Air Vehicle ROS (MAVROS) node with the MAVLink communication protocol. As illustrated in Fig. 1, MAVProxy node acts as an intermediary between the GCS and UAVs supporting MAVLink protocol.

Lastly, as illustrated in Fig. 1, this module provides UAVs information data to the Adjustable Autonomy Module by means of the RosBridge Protocol<sup>5</sup>. More specifically, these information regarding drones' battery level, later abbreviated *b* and their distance from obstacles (e.g. buildings), later abbreviated *o*, are gathered from the Alert Module to determine the status of each drone.

#### **3.3 Bandwidth Simulator**

In this work, the network transmission rate was assumed to depend on two different variables: population density of the city sites (parks, stadiums, schools, etc.) and the network coverage. Three different values, in the range [1;3] - where 1 is *"Low"*, 2 is *"Medium"* and 3 is *"High"* - were used to describe the population density and network coverage levels of the city according to daily time slots and OpenSignal<sup>6</sup> data respectively. A grid on the map was created by storing in

<sup>1</sup> http://ardupilot.org/dev/docs/sitl-simulator-software-in-the-loop.html.

<sup>2</sup> https://px4.io.

<sup>3</sup> https://gazebosim.org.

<sup>4</sup> https://www.ros.org.

<sup>5</sup> https://wiki.ros.org/rosbridge suite.

<sup>6</sup> https://opensignal.com.

each cell the population density and coverage values described above in order to calculate the bandwidth in the considered area. The resulting transmission rate for each cell was computed according to a linear polynomial function *y* of the above values as follow:

$$Bandwidth = \begin{cases} \begin{array}{c} High \quad if \quad y < 0.5\\Medium \text{ } if 0.5 \ge y < 1.5\\ Low \quad if \quad y \ge 1.5 \end{array} \end{cases}$$

As illustrated in Fig. 1, the three different calculated bandwidth levels (later abbreviated *n*) are sent to the Adjustable Autonomy Module in order to determine the transmission rate around the drone's position on the map.

#### **3.4 Adjustable Autonomy Control Tower**

The Adjustable Autonomy Control Tower consists of two submodules namely: *Alert Module* and *Decision Module*.

The *Alert Module*, as illustrated in Fig. 1, receives data from the *UAVs* and *Bandwidth Simulators* as inputs. Each input is associated to three different variables, namely *"High"*, *"Medium"* and *"Low"* according to Table 1 and each variable is matched with a numeric value in the range [1; 3] - where 1 is *"Low"* and 3 is *"High"*.


**Table 1.** Drones' information association to variables

The mathematical formula described in (1) was exploited to compute the *Alert*:

$$y = \frac{1}{b-1} \ast \frac{1}{o-1} \ast \frac{1}{n-1} \tag{1}$$

where *b*, *o*, *n*, represent the three inputs listed in Table 1 and *y* represents the drone's level of risk. Thus, the resulting *Alert* was calculated as follows:

$$Aert = \begin{cases} Damage & if \quad b = 1 \quad \lor \ o = 1 \lor n = 1\\ Waaring & if 0.15 < y < 1.5\\ Safe & if \quad y \ge 1.5 \end{cases}$$

It can be observed in (1) that when one of the input variables value is *"Low"*, the *Alert* assumes the *"Danger"* value. When the input variables values increase, then the *Alert* decreases from *"Danger"* to *"Safe"* through the *"Warning"* level.

The *Decision Module* represents the core of the devised architecture. It is responsible for inferring the appropriate level of autonomy by elaborating both operators' mental workload and mission outcomes via the number of UAVs divided by *"Alert"* state.

A Bayesian Network (BN) classifier, which is a learning probabilistic model from data, was selected for representing both all variables involved in the study and their relationships in order to infer conclusions when some variables are observed. The structure of this model where the estimate LOA of the system is a direct child of the mission outcomes node via workload node is illustrated in Fig. 2. It was considered that the probability of changes in operators' workload is conditioned on changes in the number of drones in *"Alert"* state. Thus, the probability to successfully complete missions is influenced by operators' cognitive workload.

The LOAs proposed in this work, were namely: "*Warning*", "*Suggestion*" and "*Autonomous*" where the system warns the operator if critical situations occur, suggests feasible actions to him or monitors and performs actions autonomously without any human intervention respectively.

**Fig. 2.** Bayesian Network model inferring the LOA from drones missions outcomes thus from subjective mental workload features via number of UAVs divided by *"Alert"* state.

## **4 User Interface**

In this section, a user interface showing the 2D map of the city for displaying drones' positions and useful information for the human operator is presented. The devised interface allows the human operator to take control of drones through different flight commands. Depending on the current LOA of the system, the number or type of flight commands displayed dynamically changes thus defining the "*Warning*" or "*Suggestion*" interface.

A wide region of the operator's display is covered with the 2D map of the city in which drones are shown in real time. A colored marker on the map is used to indicate both the drone's GPS position and its current *"Alert"* (Fig. 3a). Three different color are used to depict the drone's level of risk: green (*"Safe"*), yellow

**Fig. 3.** Warning interface (a), UAVs data summary (b), flight commands in Suggestion interface (c) and control and display information buttons (d). (Color figure online)

(*"Warning"*) and red (*"Danger"*). Drone's marker color changes from green to red according to the linear interpolation described in (1). An extensive visual summary of data about each drone is shown on the panel in the right side of the interface (Fig. 3b). For each drone is reported its unique name, its battery level, the bandwidth coverage of the area around its location and its flying altitude. Right below the map are five controls buttons by which the operator can either issue flight commands or show information about the map or UAVs are placed (Fig. 3d). The "*Start*" button is used to run the UAVs simulation, whereas the "*Options*" button is used to show or hide the bandwidth coverage grid of the city and the drones' paths. The other three buttons namely, "*Land*", "*Hovering*", and "*Change Path*" are only available in the "*Warning*" interface and are used by the human operator to take direct control of the drone. In this modality, the UAV controller can land, hover or change the drone's assigned path by defining the next waypoint with respect to the drone's current position. On the contrary, in the "*Suggestion*" interface, the operator can only select actions among those suggested from the system in the summary panel on the right of the interface (Fig. 3c), according to Table 2. The replanning action implemented in this work provides an alternative path from the actual position of the drone to its target location by exploiting the Bing Map REST API<sup>7</sup> with a route planning request.

### **5 Experimental Results**

As anticipated, the goal of this paper is to build an adjustable autonomy system exploiting decision-making capabilities able to assist control tower operator by

<sup>7</sup> https://msdn.microsoft.com/it-it/library/ff701713.aspx.


**Table 2.** System suggested actions for each drone.

predicting mental workload changes or overload when the number of UAVs to be monitored highly increases. To this aim, a BN probabilistic model classifier was defined in this work to learn from data collected through a user study, how to infer the appropriate level of autonomy in drone-traffic-control tasks. Participants involved in the study (6 males and 2 females, aged between 24 to 27), were selected from the students of Politecnico di Torino in order to gather data needed for developing a first prototype of the system. A preliminary experiment with 4 participants was conducted to establish a prior subdivision of the number of drones in three different ranges, namely: "*Low*", "*Medium*", and "*High*". In order to do this, participants were invited to monitor from 1 to 6 UAVs characterized by a level of risk linearly proportional to the number of drones. Results obtained showed that a number of drones in "*Low*", "*Medium*" and "*High*" ranges consists in 1, 2 and from 3 up UAVs respectively.

Afterwards, a brief training phase was performed to instruct participants to act as a real UAVs controller by performing some supervision and monitoring tasks of a growing number of drones. They were invited to monitor and eventually intervene on drones' behavior by exploiting flight commands showed in the user interface when critical conditions were warned by the UAVs through an alert.

The experiment was organized in six sessions (1 practice and 5 tests) of two trials, one in "*Warning*" mode and the other in "*Suggestion*" mode by exploiting the related interface. The above modalities were chosen in a random order so that to limit the effect of learning. Each trial lasted approximately 4 min.

The first test (labeled T1), consisted of a single flying drone whose path was designed for avoiding obstacles on its route. The other two tests T2 and T3 were meant to evaluate the operator's performance in monitoring two drones flying in a medium bandwidth zone and at risk of colliding, respectively. The fourth test (labeled T4) consisted of three drones, two of which at high risk of colliding and one with a medium battery level. The other test T5 consisted of five drones, three of which at high risk of colliding. Lastly, T6 consisted of six drones, each of which required operator's interventions to successfully complete the mission. The outcome of each test may be *"successfully completed"* - if all drones land correctly in the intended positions - or *"failed"* - if at least one drone crashes. Such tasks have been specifically designed to test the operator's performance in the possible scenarios he could be involved into in air-traffic management.

During each trial, quantitative data about number of unmanaged drones thus the outcome of each mission as well as information about the *"Alert"* status of each drone were recorded. At the end of each trial, participants were asked to fill a NASA Task Load Index (TLX) questionnaire [18] for each action performed on the drones. This questionnaire was exploited to evaluate operators' self-assessed workload on a six-dimensions scale regarding: *mental demand*, *physical demand*, *temporal demand*, *performance*, *effort*, and *frustration*, with a score from 0 to 100. A global score is then calculated by a weighting procedure to combine the six individual scale ratings. At the end of each session (after two trials), participants were also asked to indicate which LOA of the system they preferred in performing the test. For each participant, the execution of the tests and the compilation of the questionnaires took about 2 h.

**Fig. 4.** Results in terms of (a) percentage of participants able to succeed missions and (b) NASA-TLX average score in the considered missions.

Results obtained in terms of number of completed missions as well as percentage of participants able to complete such missions are reported in Fig. 4a. Whereas results concerning average values of the operators' perceived workload scores are illustrated in Fig. 4b. It can be observed that the percentage of participants able to complete mission T1 is significantly greater compared to the missions T5 and T6. Concerning operators' self-assessed mental workload, the NASA-TLX average score of mission T6 appeared to be considerably higher than the others. Moving from these findings, operators' mental workload score in managing 1, 2 or more than 3 UAVs may be labeled as "*Low*", "*Medium*" and "*High*" workload respectively. These findings corroborate the preliminary results obtained above by confirming the previous subdivision into three ranges according to the number of drones.

Results obtained were then exploited to train the Bayesian Network classifier to learn how to determine the appropriate level of autonomy for the system. Evaluation from the point of view of accuracy was then performed. For this purpose, a cross validation technique was used to test the classification model performance and its ability to predict LOAs on unseen data. According to this validation methodology, data collected were divided into two different groups, namely *training set* - for training the BN - and *validation set* - for accuracy validation - as follows: 80% and 20% of the data respectively. Overall data set contains as many rows as the actions carried out by participants on drones. Each row consists of the number of UAVs in the three *"Alert"* states, the operator's mental workload level, the outcome of the mission and his/her preferred LOA in that situation. An example of a test result is shown in Table 3. Then the corresponding line for building both *training* and *validation sets* is shown in Table 4. The Bayesian Network training phase was performed by exploiting the Netica Software<sup>8</sup> then the validation methodology was performed by obtaining a classification LOA accuracy equal to 83*.*44%. Table 5 shows the confusion matrix for each level of autonomy considered in this study.


**Table 3.** Example of a test result with 3 UAVs.



**Table 5.** Confusion matrix


# **6 Conclusions and Future Work**

In this work, an adjustable autonomy system exploiting decision-making capabilities was developed to assist UAV operators by predicting the appropriate LOA relying on operators' mental workload measurements in drone monitoring scenarios. A Bayesian Network (BN) classifier was exploited as learning probabilistic model and the NASA-TLX questionnaire as subjective workload assessment technique. Obtained results show the proposed model is able to predict the appropriate LOA with an accuracy of 83.44%. Future work will focus on alternative workload assessment techniques, such as physiological measurements, to capture cognitive information in real-time and continually with higher reliability in the measurements.

<sup>8</sup> https://www.norsys.com.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### Brain Computer Interfaces

## **Classification of Motor Imagery Based EEG Signals Using Sparsity Approach**

S. R. Sreeja1(B), Joytirmoy Rabha<sup>1</sup>, Debasis Samanta<sup>1</sup>, Pabitra Mitra<sup>1</sup>, and Monalisa Sarma<sup>2</sup>

<sup>1</sup> Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India sreejasr@iitkgp.ac.in, joydan4123@gmail.com, dsamanta@sit.iitkgp.ernet.in, pabitra@cse.iitkgp.ernet.in <sup>2</sup> Subir Chowdhury School of Quality and Reliability, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India monalisa@iitkgp.ac.in

**Abstract.** The advancement in brain-computer interface systems (BCIs) gives a new hope to people with special needs in restoring their independence. Since, BCIs using motor imagery (MI) rhythms provides high degree of freedom, it is been used for many real-time applications, especially for locked-in people. The available BCIs using MI-based EEG signals usually makes use of spatial filtering and powerful classification methods to attain better accuracy and performance. Inter-subject variability and speed of the classifier is still a issue in MI-based BCIs. To address the aforementioned issues, in this work, we propose a new classification method, spatial filtering based sparsity (SFS) approach for MIbased BCIs. The proposed method makes use of common spatial pattern (CSP) to spatially filter the MI signals. Then frequency bandpower and wavelet features from the spatially filtered signals are used to bulid two different over-complete dictionary matrix. This dictionary matrix helps to overcome the issue of inter-subject variability. Later, sparse representation based classification is carried out to classify the two-class MI signals. We analysed the performance of the proposed approach using publicly available MI dataset IVa from BCI competition III. The proposed SFS method provides better classification accuracy and runtime than the well-known support vector machine (SVM) and logistic regression (LR) classification methods. This SFS method can be further used to develop a real-time application for people with special needs.

**Keywords:** Electroencephalography (EEG) Brain computer interface (BCI) · Motor imagery (MI) Sparisty based classification · BCI for motor impaired users

### **1 Introduction**

Brain-Computer Interface systems (BCIs) provides a direct connection between the human brain and a computer [20]. BCIs capture neural activities associated with an external stimuli or mental tasks, without any involvement of nerves and muscles and provides an alternative non-muscular communication [21]. The interpreted brain activities are directly translated into sequence of commands to carry out specific tasks such as controlling wheel chairs, home appliances, robotic arms, speech synthesizer, computers and gaming applications. Although, brain activities can be measured through non-invasive devices such as functional magnetic response imaging (fMRI) or magnetoencephalogram (MEG), most common BCI are based on Electroencephalogram (EEG). EEG-based BCIs facilitates many real-time applications due to its affordable cost and ease of use [18].

EEG-based BCI systems are mostly build using visually evoked potentials (VEPs), event-related potentials (ERPs), slow cortical potentials (SCPs) and sensorimotor rhythms (SMR). Out of these potentials SMR based BCI provides high degrees of freedom in association with real and imaginary movements of hands, arms, feet and tongue [10]. The neural activities associated with SMR based motor imagery (MI) BCI are the so-called mu (7–13 Hz) and beta (13–30 Hz) rhythms [16]. These rhythms are readily measurable in both healthy and disabled people with neuromuscular injuries. Upon executing real or imaginary motor movements, it causes amplitude supression or enhancement of mu rhythm and these phenomena are called event-related desynchronization (ERD) and event-related synchronization (ERS), respectively [16].

The available MI-based BCI systems makes use of spatial filtering and a powerful classification methods such as support vector machine (SVM) [17,18], logistic regression (LR) [13], linear discriminant analysis (LDA) [3] to attain good accuracy. These classifiers are computationally expensive and makes the BCI system delay. For real-time BCI applications, the ongoing MI events have to be detected and classified continuously into control commands as accurately and quickly. Otherwise, the BCI user especially motor impaired people may get irritated and bored. Moreover, for the same user, the observed MI patterns differ from one day to another, or from session to session [15]. This inter-personal variability of EEG signals also results in degraded performance of the classifier. The above issues motivates us to design a MI-based BCI system with enhanced accuracy, speed and no inter-subject variations for people with special needs.

With this purpose in hand, we propose a new spatial filtering based sparsity (SFS) approach in this paper to classify MI-based EEG signals for BCIs. In recent years, sparsity based classification has received a great deal of attention in image recognition [22] and speech recognition [9] field. In compressive sensing (CS), this sparsity idea was used and according to CS theory, any natural signal can be epitomized sparsely on definite constraints [5,8]. If the signal and an over-complete dictionary matrix is given, then the objective of the sparse representation is to compute the sparse coefficients, so that the signal can be represented as a sparse linear combination of atoms (columns) in dictionary [14]. If the dictionary matrix is designed from the best extracted feature of MI signal, it helps to overcome the issue of inter-personal and intra-personal variability, also enhances the processing speed and accuracy of the classifier.

**Fig. 1.** Framework of the proposed SFS system.

The framework of the proposed system is shown in Fig. 1. In our proposed method, from 10–20 international system of EEG electrode placement, we considered only few channels located over motor areas for further processing. Later, the selected channels of EEG data are passed through a band-pass filter between 7–13 Hz and 13–30 Hz, as it is known from literature that most of MI signals lie within that frequency range. Then CSP is applied to spatially filter the signals and the features obtained from the filtered signals are used to build the columns (atoms) of dictionary matrix. This is an important phase in the proposed approach which is responsible for removing inter-personal variability and enhancement of classification accuracy. Later, sparsity based classification is carried out to discriminate the patterns of two-class MI signals. Furthermore, SFS method provides better accuracy and speed than the conventional support vector machine (SVM) and logistic regression (LR) classifier models.

Our paper is organised as follows. In Sect. 2, we present description of the data and the proposed technique in details. In Sect. 3, the experimental results and performance evaluation are presented. Finally, conclusions and future work are outlined in Sect. 4.

### **2 Data and Method**

This section will describe the MI data used in this research and then the pipeline followed in the proposed method, that is, channel selection, pre-processing and spatial filtering based sparsity (SFS) classification of EEG-based MI data is discussed in detail.

#### **2.1 Dataset Description**

We used the publicly available dataset IVa from BCI competition III<sup>1</sup> to validate the proposed approach. The dataset consists of EEG recorded data from five healthy subjects (aa, al, av, aw, ay) who performed right-hand and right-foot MI tasks during each trial. According to the international 10–20 system, MI signals were recorded from 118 channels. For each subject, there were 140 trials for each

<sup>1</sup> http://www.bbci.de/competition/iii.

task, and therefore 280 trials totally. The measured EEG signal was filtered using a bandpass filter between 0.05–200 Hz. Then the signal was digitized at 1000 Hz with 16 bit accuracy and it is downsampled to 100 Hz for further processing.

#### **2.2 Channel Selection and Preprocessing**

The dataset consists of EEG recordings from 118 channels which is very large to process. As we are using the EEG signal of two class MI tasks (right-hand and right-foot), we extract the needed information from premotor cortex, supplementary motor cortex and primary motor cortex [11]. Therefore, from the 118 channels of EEG recording, 30 channels present over the motor cortex are considered for further processing. Moreover, removal of irrelevant channels helps to increase the robustness of classification system [19]. The selected channels are FC2, FC4, FC6, CFC2, CFC4, CFC6, C2, C4, C6, CCP2, CCP4, CCP6, CP2, CP4, CP6, FC5, FC3, FC1, CFC5, CFC3, CFC1, C5, C3, C1, CCP5, CCP3, CCP1, CP5, CP3 and CP1. The motor cortex and the areas of motor functions, the standard 10 ± 20 system of electrode placement of 128 channel EEG system and the electrodes selected for processing is shown in Fig. 2. The green and red circle indicates the selected channels and the red circle indicates the C3 and C4 channels on the left and right side of the scalp respectively.

**Fig. 2.** (a) Motor cortex of the brain (b) Standard 10 *±* 20 system of electrode placement for 128 channel EEG system. The electrodes in green and red colour are selected for processing (c) The anterior view of the scalp and the selected channels. (Color figure online)

From domain knowledge we know that, most brain activities related to motor imagery are within the frequency band of 7–30 Hz [16]. Bandpass filter can be used to extract the particular frequency band and also helps to filter out most of the high frequency noise. The bandpass filter can have as many sub-bands as one needed [12]. We have experimented with two sub-bands of 7–13 Hz and 13–30 Hz in the two-class MI signal classification problem. The choice of two sub-bands is due to the fact that mu (μ), beta (β) rhythms reside within those frequency bands. Then data segmentation is done where we used two second samples after the display of cue of each trial. Each segmentation is called as an *epoch*.

#### **2.3 Proposed Spatial Filtering Based Sparsity Approach**

The proposed spatial filtering based sparsity (SFS) approach follows three steps such as CSP filtering, design of dictionary matrix and sparsity based classification. A detailed explanation of each of these steps are given below.

**CSP Filtering:** Generally, for binary classification problems, CSP has been applied widely as it increases the variance of one class while it reduces the variance for the other class [1]. In this paper, how CSP filtering is applied for the given two-class MI-based EEG dataset is explained briefly. Let **X<sup>1</sup>** and **X<sup>2</sup>** be the two epochs of a multivariate signal related to right-hand and right-foot MI classes, respectively. They are both of size (c × n) where c is the number of channels (30) and n is the number of samples (100 × 2). We denote the CSP filter by

$$\mathbf{X}\_i^{CSP} = \mathbf{W}^T \mathbf{X}\_i \tag{1}$$

where i is the number of MI classes, **X***CSP <sup>i</sup>* is the spatially filtered signal, **W** is the spatial filter matrix and **<sup>X</sup>***<sup>i</sup>* <sup>∈</sup> <sup>R</sup>*<sup>c</sup>*×*<sup>n</sup>* is the input signal to the spatial filter. The objective of the CSP algorithm is to estimate the filter matrix **W**. This can be achieved by finding the vector **w**, the component of the spatial filter **W**, by satisfying the following optimization problem:

$$\max\_{\mathbf{w}} \left( \frac{\mathbf{w}^T C\_1 \mathbf{w}}{\mathbf{w}^T C\_2 \mathbf{w}} \right) \tag{2}$$

where C<sup>1</sup> = **X1X<sup>T</sup> <sup>1</sup>** and C<sup>2</sup> = **X2X<sup>T</sup> <sup>2</sup>** . In order to make the computation easier to find **w**, we computed **X<sup>1</sup>** and **X<sup>2</sup>** by taking the average of all epochs of each class. Solving the above equation using Lagrangian method, we finally have the resulting equation as:

$$C\_1 \mathbf{w} = \lambda C\_2 \mathbf{w} \tag{3}$$

Thus Eq. (2) becomes **eigenvalue decomposition problem**, where λ is the eigenvalue corresponds to the eigenvector **w**. Here, **w** maximizes the variance of right-hand class, while minimizing the variance of right-foot class. The eigenvectors with the largest eigenvalues for C<sup>1</sup> have the smallest eigenvalues for C2. Since we used 30 EEG channels, we will have 30 eigenvalues and eigenvectors. Therefore, CSP spatial filter **W** will have 30 column vectors. From that, we select the first m and last m columns to use it as 2m CSP filter of **WCSP**.

$$\mathbf{W\_{CSP}} = [\mathbf{w\_1}, \mathbf{w\_2}, \dots, \mathbf{w\_m}, \mathbf{w\_{c-m+1}}, \dots, \mathbf{w\_c}] \in \mathbb{R}^{2m \times c} \tag{4}$$

Therefore, for the given two-class epochs of MI data, the CSP filtered signals are defined as follows:

$$\begin{aligned} \mathbf{X}\_1^{CSP} \in \mathbb{R}^{2m \times n} &:= \mathbf{W}\_{CSP}^T \mathbf{X}\_1 \\ \mathbf{X}\_2^{CSP} \in \mathbb{R}^{2m \times n} &:= \mathbf{W}\_{CSP}^T \mathbf{X}\_2 \end{aligned} \tag{5}$$

The above CSP filtering is simultaneously done for the filtered signals under the sub-bands of 7–13 Hz and 13–30 Hz.

**Designing a Dictionary Matrix:** The spatially filtered signals **X***CSP* <sup>1</sup> and **X***CSP* <sup>2</sup> are obtained for each epoch and for each sub-band. These spatially filtered signals are considered as the training signals in our experiment. Let the number of total training signals be N, considering each MI class i and each subband. Here, i = 1 for right-hand and i = 2 for right-foot class. The dictionary matrix can be designed with one type of feature or a combination of different features. In this work, we designed two types of dictionary matrix, one using frequency bandpower as feature and the other using wavelet transform energy as feature for each training signal. Initially, we experimented with many features like statistical, frequency-domain, wavelet-domain, entropy, auto-regressive coefficients, etc. But we found that bandpower and wavelet energy produces good differentiable between the two classes when it is plotted over the scalp. Figure 3 shows the spatial representation of bandpower and wavelet energy for two different MI classes. The Fig. 3(a) depicts that the bandpower of right-hand is scattered throughout the scalp while for right-foot the bandpower is high in the frontal region. In the same way, in Fig. 3(b) the wavelet energy is distributed all over the scalp for right-hand and only on a particular region for right-foot. Hence, these features are sufficiently good enough to discriminate the two MI classes.

**Fig. 3.** Scalp plot of (a) bandpower of right-hand and right-foot MI respectively and (b) wavelet energy for right-hand and right-foot MI respectively.

From each row of the training signal, the second moment or the frequency bandpower and the wavelet energy using 'coif1' wavelet is calculated. This feature vector of each training signal forms the dictionary matrix. Concatenating the dictionary matrix of two-classes forms an over-complete dictionary. Since this dictionary matrix includes all the possible characteristics of the MI signals of the subjects, the inter-subject variability can be avoided. Figure 4 shows the dictionary constructed for the proposed approach. Thus, the dictionary matrix is defined as **D** := [**D1**; **D2**], where D*<sup>i</sup>* = [d*i,*<sup>1</sup>, d*i,*<sup>2</sup>, d*i,*<sup>3</sup>, ..., d*i,N* ]. Each atom or column of the dictionary matrix is defined as <sup>d</sup>*i,j* <sup>∈</sup> <sup>R</sup><sup>2</sup>*m*×<sup>1</sup>, j = 1, <sup>2</sup>, ..., N, having 2m features. So, the dimension of the dictionary matrix **D** using bandpower as feature will be 2m × 4N and it is denoted as **DBP** and the same dimension remains on using wavelet energy as feature and it is denoted as **DWE**.

**Fig. 4.** Two-class dictionary designed for our proposed SFS approach. Each atom in the dictionary is obtained from the training signal of each class and each sub-band.

**Sparse Representation:** After the construction of dictionary matrix, we have our linear system of equations to get the sparse representation for the input test signal. The test signal is first converted into a feature vector **<sup>y</sup>** <sup>∈</sup> <sup>R</sup>*<sup>m</sup>*×<sup>1</sup>, using the same way as the columns in dictionary **D** is generated. So the input vector can be represented as a linear combination of few columns of **D** and it is represented as:

$$\mathbf{y} = \sum\_{i} s\_{i,1} \mathbf{d}\_{i,1} + s\_{i,2} \mathbf{d}\_{i,2} + \dots + s\_{i,N} \mathbf{d}\_{i,N} \tag{6}$$

where <sup>s</sup>*i,j* <sup>∈</sup> <sup>R</sup>, j = 1, <sup>2</sup>, ..., N are the sparse coefficients and <sup>i</sup> = (1, 2) for the two-class MI signals. In matrix form it can be represented as:

$$\mathbf{y} = \mathbf{D}\mathbf{s}\tag{7}$$

where **s** = [s*i,*<sup>1</sup>, s*i,*<sup>2</sup>..., s*i,N* ] *<sup>T</sup>* . The objective of the sparse representation is to estimate the scalar coefficients, so that we can sparsely represent the test signal as a linear combination of few atoms of dictionary **D** [14]. The sparse representation of an input signal **y** can be obtained by performing l<sup>0</sup> norm minimization as follows:

$$\min\_{\mathbf{s}} \left\| \mathbf{s} \right\|\_{0} \quad subject \text{ to} \quad \mathbf{y} = \mathbf{D} \mathbf{s} \tag{8}$$

l<sup>0</sup> norm optimization gives us the sparse representation but it is an NP-hard problem [2]. Therefore, a good alternative is the l<sup>1</sup> norm which can also be used to obtain sparsity. Recent development tells us that the representation obtained by l<sup>1</sup> norm optimization problem achieves the condition of sparsity and it can be solved in polynomial time [6,7]. Thus the optimization problem in Eq. (8) becomes:

$$\min\_{\mathbf{s}} \left\| \mathbf{s} \right\|\_{1} \quad subject \text{ to} \quad \mathbf{y} = \mathbf{D} \mathbf{s} \tag{9}$$

The orthogonal matching pursuit (OMP) is a greedy algorithm used to obtain sparse representation and is one of the oldest greedy algorithms [4]. It employs the concept of orthogonalization to get orthogonal projections at each iteration and is known to converge in few iterations. For OMP to work in the desired way, all the feature vectors in dictionary **D** should be normalized such that **D***i*(j) = 1, where i = (1, 2) are the classes and j = 1, 2, ..., N. Using OMP we obtained the sparse representation **s**, for the feature vector **y**, which will be used further for classifying MI signals.

**Sparsity Based Classification:** After a successful minimization of sparse representation, the input vector **y** will be approximated as a sparse vector which has the same size as the number of atoms in the dictionary **D**. Each value of the sparse vector corresponds to the weight given to the corresponding atom of the dictionary. The dictionary is made of equal number of atoms for each class. If for example, there are 1400 atoms in the dictionary for a two-class MI, the first 700 values of the sparse signal tells us the linear relationship between the input vector and the first class i.e. right-hand MI class and so on. Hence, the results of the sparse representation can be used for classification by implying some simple classification rules in the sparse vector **s**. In this work, we make use of two classification rules and it is termed as classif ier<sup>1</sup> and classif ier2. Mathematically, it is defined as follows:

$$Classifier\_1(\mathbf{y}) = \underset{i=1,2}{\operatorname{argmax}} \max\left(Var\left(\mathbf{s\_i}\right)\right) \tag{10}$$

$$Classifier\_2(\mathbf{y}) = \underset{i=1,2}{argmax} \max\left(nonzero \, (\mathbf{s\_i})\right) \tag{11}$$

where max() is a function that returns the maximum value of a vector, the function V ar() is used to find the variance of data and nonzero() is used to find the number of sparse (non-zero) elements in a vector. The class i is determined, if it has maximum variance or maximum number of non-zero elements.

# **3 Experimental Results**

The performance of the model in our experiment depends on the prediction performance of the classifier. A k-fold cross validation was performed on the dataset to split the entire data into k folds, from which k − 1 folds were used to build the dictionary and one fold for testing the model. Each fold was used for testing iteratively and the accuracies were calculated. Two different dictionaries were built: one with bandpower features **DBP** and the other with energies of a wavelet transform **DWE**. Accuracy of a model based on training and testing test, is a good metric by itself to calculate the performance of the classifier.

### **3.1 Results of Sparsity Based Classification**

We had right-hand and right-foot MI signals that needed to be classified. To illustrate how sparsity plays an important role in our classification, Fig. 5 shows

**Fig. 5.** Sparse representation **s** obtained for the two sample test signals. Here, the left figure represents the sparse signal of right-hand class and the right figure for the right-foot class.

the sparse representation of two sample test signals belonging to two different classes using **DBP** as dictionary matrix. Here there are around 1400 atoms in the dictionary and so the first 700 elements corresponds to the first class and the rest for the second class. We can clearly see that the sparse representation is classifying the input signal with high accuracies. Table 1 shows the accuracies of each of the two classifiers in k-fold cross validation using the dictionaries **DBP** and **DWE**, respectively. The result shows that classif ier<sup>1</sup> performs better than classif ier2. It also shows us that the sparsity based classification using the dictionary **DWE** outperforms the band-power dictionary **DBP**. The normalized and non-normalized confusion matrices of each of the classifiers using dictionary **DWE** is given in the Fig. 6.

**Table 1.** *k*-fold cross validation accuracies for the classifiers using **DBP** and **DWE** dictionary.


#### **3.2 Comparison with SVM and LR**

To evaluate the proposed SFS method, we compared our method using **DWE** as dictionary with the conventional SVM [17,18] and LR [13] methods. As classif ier<sup>1</sup> gives better accuracy than classif ier2, it is used for comparison with the conventional methods. For real-time BCI applications, speed of the classifier is an important issue. Hence, CPU execution time is estimated for all the methods. All the classifier algorithms were performed using the same computer

**Fig. 6.** Confusion matrix of *classif ier*<sup>1</sup> and *classif ier*<sup>2</sup> using the dictionary **DWE**.

and same software Python 2.7, making use of Scikit Learn<sup>2</sup> machine learning package. The accuracies and the CPU execution time obtained for different folds for the proposed SFS method using classif ier<sup>1</sup> and **DWE** as dictionary, and the

**Table 2.** Comparison of *k*-fold cross-validation accuracy and CPU execution time of various folds for the proposed SFS approach, and the conventional SVM and LR classifier methods.


<sup>2</sup> http://scikit-learn.org.

conventional SVM and LR are listed in Table 2. The average values obtained indicates that the proposed SFS method delivers high average classification accuracy and lesser execution time than the SVM and LR methods. Since the proposed method executes in lesser time with higher accuracy, it can be further used to build real-time MI-based BCI applications for motor disabled people.

### **4 Conclusion**

In this work, we used a new spatial filtering based sparsity (SFS) approach to classify two-class MI-based EEG signals for BCI applications. Firstly, the EEG signal with 118 channels are of high-dimension. To reduce the computational complexity, constraints are applied on selecting channels. Secondly, to better discriminate the MI classes, two sub-bands of band-pass filter between 7–13 Hz and 13–30 Hz are applied to the selected number of channels followed by CSP filtering. Thirdly, it is important to note that EEG signals produce variations among users at different sessions. As SFS method requires a dictionary matrix, it is designed using the bandpower and wavelet features obtained from the spatially filtered signals. This dictionary matrix helps us to overcome the inter-subject variability problem. This method also reduces the computational complexity significantly and increases the speed and accuracy of the BCI system. Hence, the proposed SFS approach can be served to design a more robust and reliable MI-based real-time BCI applications like text-entry system, gaming, wheel-chair control, etc., for motor impaired people. Future work will focus on extending the sparsity approach for classifying multi-class MI tasks which can be further used for communication purpose.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Mental Workload Assessment for UAV Traffic Control Using Consumer-Grade BCI Equipment**

Federica Bazzano(B) , Paolo Montuschi, Fabrizio Lamberti, Gianluca Paravati, Silvia Casola, Gabriel Cer`on, Jaime Londo˜no, and Flavio Tanese

Dip. di Automatica e Informatica, Politecnico di Torino, Corso Duca degli Abruzzi, 24, 10129 Turin, Italy *{*federica.bazzano,paolo.montuschi,fabrizio.lamberti,gianluca.paravati, silvia.casola,gabriel.ceron,jaime.londono,flavio.tanese*}*@polito.it

**Abstract.** The increasing popularity of unmanned aerial vehicles (UAVs) in critical applications makes supervisory systems based on the presence of human in the control loop of crucial importance. In UAVtraffic monitoring scenarios, where human operators are responsible for managing drones, systems flexibly supporting different levels of autonomy are needed to assist them when critical conditions occur. The assessment of UAV controllers' performance thus their mental workload may be used to discriminate the level and type of automation required. The aim of this paper is to build a mental-workload prediction model based on UAV operators' cognitive demand to support the design of an adjustable autonomy supervisory system. A classification and validation procedure was performed to both categorize the cognitive workload measured by ElectroEncephaloGram signals and evaluate the obtained patterns from the point of view of accuracy. Then, a user study was carried out to identify critical workload conditions by evaluating operators' performance in accomplishing the assigned tasks. Results obtained in this study provided precious indications for guiding next developments in the field.

**Keywords:** Adjustable autonomy · Mental workload Supervisory control · Learning model

### **1 Introduction**

In recent years, the unmanned aerial vehicle (UAV) applications domain has seen a rapid growing interest in the development of systems able to assist human beings in critical operations [1–3]. Examples of such applications include security and surveillance, monitoring, search and rescue, disaster management, etc. [4].

Systems able to flexibly support different levels of autonomy (LOAs) according to both humans' cognitive resources and their performance in accomplishing

c The Author(s) 2017

Work reported has been partially funded by TIM JOL Connected Robotics Applications LaB (CRAB).

P. Horain et al. (Eds.): IHCI 2017, LNCS 10688, pp. 60–72, 2017. https://doi.org/10.1007/978-3-319-72038-8\_6

critical tasks, may be exploited to determine situations in which system intervention may be required [5–7]. The human's cognitive resources and the ability of the system to dynamically change the LOA according to the considered context are generally termed as "*cognitive or mental workload*" [8] and "*adjustable or sliding autonomy*" [9], respectively.

In literature, several criteria have been investigated to evaluate human's cognitive load. The main measurement techniques have been historically classified into three categories: physiological, subjective, and performance-based [10]. Physiological measurements are cognitive load assessment techniques based on the physical response of the body. Subjective measurements are used to evaluate humans' perceived mental workload by exploiting rankings or scales. Performance or objective measurements are used to evaluate humans' ability to perform a given task.

By moving from the above considerations, the aim of this paper is to build a classification and prediction model of UAV operators' mental workload to support the design of an adaptive autonomy system able to adjust its level of autonomy accordingly. An ElectroEncephaloGram (EEG) signals was used as physiological technique for assessing operators' mental workload and a Support Vector Machine (SVM) was leveraged as learning and classification model [11–13].

A 3D simulation framework was exploited in this work to both experiment different flying scenarios of a swarm of autonomous drones flying in an urban environment and test the operator's performance in UAV-traffic management. A user interface was also used to show the 2D visualization of experimented environment and allow human operators to interact with UAVs by issuing flight commands.

A user study was carried out with several volunteers to both evaluate operators' performance in accomplishing supervision tasks of a growing number of drones and gather different workload measurements under critical conditions.

The rest of the paper is organized as follows. In Sect. 2, relevant works concerning workload measurements are reviewed. In Sect. 3, the device exploited in the study is described. Sections 4 and 5 provide an overview of the overall simulation framework and report details of the user interface considered in this work, respectively. Sections 6 and 7 introduce the methodology that has been adopted to perform the experimental tests and discuss data analysis and the classification procedure. Lastly, Sect. 8 discusses obtained results and concludes the paper by providing possible directions for future research activities in this field.

### **2 Related Work**

Many studies have investigated the relationship between tasks performed by an individual and its cognitive load. In literature, different techniques have been proposed for mental workload assessment [10].

For instance, concerning subjective measurements techniques, [14,15] have exploited the NASA-TLX questionnaire to evaluate users' perceived workload in gaze-writing and robotic manipulation tasks, respectively. Similarly, Squire et al. [16] have investigated the impact of self-assessed mental workload in simulated game activities.

Despite, these measurements have been proved to be a reliable way to assess humans' mental workload [17], they often require annoying or repetitive interactions to the users by asking them to fill different rankings or scales.

In parallel to these studies, other works have evaluated physiological measurements as mental workload assessment techniques. As a matter of example, Wilson et al. [18] exploited EEG channels, electrocardiographic (ECG), electrooculographic (EOG), and respiration inputs as cognitive workload evaluation in air traffic control tasks. Functional Near-Infrared Spectroscopy (fNIRS) and Heart Rate Variability (HRV) techniques were exploited in [19] and [20] to assess the human's mental workload in n-back working memory tasks and ship simulators, respectively. Besserve et al. [21] studied the relation between EEG data and reaction time (RT) to characterize the level of performance during a cognitive task, in order to anticipate human mistakes.

Although these studies have provided evidences to improve accuracy in workload measurements, they traditionally exploit bulky and expensive equipment virtually uncomfortable to use in real application scenarios [22]. Data about suitability of alternative devices in physiological measurements are actually required in order to properly support next advancements in the field. Some activities in this direction have been already carried out. For instance, Wang et al. [12] have proved that a small device, as a 14-channel EMOTIV-<sup>R</sup> Headset, can be successful used to characterize the mental workload in a simple memory n-back task.

The goal of the present paper is to study on results reported in [12] a different application scenario exploiting EEG signals to build a UAV operators' mental workload prediction model in drones monitoring tasks.

### **3 Emotiv Epoc Headset**

This section briefly describes the brain wearables devise EMOTIV Epoc+-<sup>R</sup> <sup>1</sup> considered in this study by illustrating its hardware and software features. More specifically, the EMOTIV Epoc+ (Fig. 1a) is a wireless Brain Computer Interface (BCI) device manufactured by Emotiv. The headset consists of 14 wireless EEG signal acquisition channels at 128 samples/s (Fig. 1b). The recorded EEG signal is transmitted to an USB dongle for delivering the collected information to the host workstation. A subscription software, named Pure·EEG is provided by Emotiv to gather both the raw EEG data and the dense spatial resolution array containing data at each sampling interval.

### **4 Simulation Framework**

The basic idea inspiring the design of the present framework is to test different UAV flying scenarios in an urban environment. Such scenarios simulate

<sup>1</sup> https://www.emotiv.com/epoc/.

**Fig. 1.** Emotiv EPOC headset (a) and its 14 recorder positions (b).

potentially critical situations in which drones could be involved in. The logical components that were assembled to implement the proposed framework are illustrated in Fig. 2. By digging more in details, the *UAVs Simulator* is the module responsible for simulating swarm of autonomous drones flying in the 3D virtual environment. It consists of three different modules, namely: *Autopilot*, *Physics Simulation* and *Ground Control Station (GCS)*.

**Fig. 2.** Logical components of the simulation framework.

The *Autopilot* module is responsible for running drones flight stability software without any specific hardware. More specifically, it exploits the Software-In-The-Loop (SITL)<sup>2</sup> simulator to run the PX4 Autopilot Flightcode<sup>3</sup> - an open source UAV firmware of a wide range of vehicle types. The *Physics Simulation* module is the block devoted to load the 3D urban environment and execute the drone flight simulation in it. Gazebo<sup>4</sup> physics engine was exploited in this block

<sup>2</sup> http://ardupilot.org/dev/docs/sitl-simulator-software-in-the-loop.html.

<sup>3</sup> https://px4.io.

<sup>4</sup> https://gazebosim.org.

for modeling and rendering the 3D models of drones with their physic properties, constraints and sensors (e.g. laser, camera). In particular, Gazebo runs on Robot Operating System (ROS)<sup>5</sup>, which is a software framework developed for performing robotics tasks. Then, the *Ground Control Station (GCS)* module contains the software used for setting drones' starting locations, planning missions and getting real-time flight information. The communication between the Autopilot Flightcode and the GCS module is provided by the Micro Air Vehicle ROS (MAVROS) node with the MAVLink communication protocol (Fig. 2).

Since drones communicate or transmit information through the network, low bandwidth coverage areas could lead to loss of communication and thus to potentially critical conditions. Hence, a *Bandwidth Simulator* is developed to estimate, in the experimented city, the maximum amount of data the network can transmit in the unit of time. The network transmission rate is assumed to depend on population density of the city sites (parks, stadiums, schools, etc.) and the network coverage.

Lastly, the *Alert Module* is the block devoted to determine the level of risk (later referred to as "*Alert*") of each drone by gathering data from both UAVs and Bandwidth Simulators. Specifically, as in [23,24], the UAVs Simulator provides drone information regarding both their battery level and their distance from obstacles (e.g. buildings). The Bandwidth Simulator sends the estimated network transmission rate in the areas around drones' positions. The mapping between these parameters and each drone's "*Alert*" is performed through a function defined as follows: <sup>y</sup> = (<sup>b</sup> <sup>−</sup> 1)−<sup>1</sup> <sup>∗</sup> (<sup>o</sup> <sup>−</sup> 1)−<sup>1</sup> <sup>∗</sup> (<sup>n</sup> <sup>−</sup> 1)−<sup>1</sup>, where <sup>b</sup> represents the drone's battery level, o is its distance from obstacles, n is the estimated bandwidth coverage around its position and y is its level of risk. Three different "*Alert*" levels are proposed in this work, namely: *"Safe"*, *"Warning"* and *"Danger"*.

### **5 User Interface**

In this section, the user interface devised for showing the 2D visualization of experimented environment and useful information allowing human operators to interact with UAVs is presented.

As illustrated in Fig. 3a, a wide region of the operator's display is covered with the 2D map of the city in which the real-time drones' locations are shown. A colored marker is used to depict the drone's GPS position as well as its current status. Three different colors are used to illustrate the drone's level of risk: green (*"Safe"*), yellow (*"Warning"*) and red (*"Danger"*). On the right side of the interface an extensive visual summary for each drone regarding its unique name, its battery level, the bandwidth coverage of the area around its location and its flying altitude, is shown (Fig. 3b). Right below the map five buttons allowing operators to issue flight commands or show general information about the map or drones are placed (Fig. 3c). More specifically, the "*Start*" button is used to run the 3D simulation, whereas the "*Options*" button to show or

<sup>5</sup> https://www.ros.org.

**Fig. 3.** Monitoring interface (a), UAVs summary (b) and control buttons (c). (Color figure online)

hide the bandwidth coverage of the city and the drones' paths. The other three buttons are used by the human operator to land, hover or change the drone's path, respectively. In this scenario, it is worth observing that EEG signals could be affected by the movement of human operators for pressing the above buttons. Thus, an artifact removal stage is needed in order to remove all undesired signals as detailed in Sect. 7.1.

### **6 User Tasks**

The goal of this paper is to exploit EEG signals to build a prediction model of the UAV operators' mental workload in order to train a system able to autonomously predict operators' performance in UAVs monitoring operations. To this aim, an SVM classification algorithm was exploited to learn the ability of operators to carry out assigned drone-traffic-control tasks in different flying scenarios. Four monitoring tasks were experimented in this work, namely: *M1*, *M2*, *M3* and *M4*. In particular, *M1* consisted of a single flying drone whose path was designed for avoiding obstacles on its route. No operator's action was necessary to successfully complete the mission. *M2* was meant to evaluate the operator's performance in monitoring two drones at risk of colliding. Collisions were specifically designed distant over time in order to allow the operator to be virtually able to deal with them by keeping the effort to complete the mission relatively low. Mission *M3* consisted of five drones, three of which at high risk of colliding. This mission was intentionally created to be very difficult to complete even though theoretically still manageable. Lastly, *M4* consisted of six drones, each of which required operator's interventions to successfully complete the mission. It was devised to be hardly to complete.

Furthermore, a mission is considered *"successfully completed"* when all drones landed in the intended positions or *"failed"* when at least one drone crashed. The number of drones in each mission was also defined relying on a preliminary experiment which proved no significance difference in operators' mental workload in monitoring three or four UAV. Data collected during mission M1 were used as a mental workload baseline whereas those recorded in M4 as high mental workload reference.

### **7 Data Analysis and Classification**

This section details the data analysis and classification procedure performed in this work. It entails the following steps: *data pre-processing*, *feature extraction* and *classification*.

### **7.1 Pre-processing**

The EEG consists of recording electric signals produced by the activation of thousands of neurons in the brain. These signals are gathered by electrodes located over the scalp of a person. However, some spurious signals may affect the EEG data due the presence of noise or artifacts. In particular, the artifacts which are signals with no cerebral origin can be divided in two groups. The first group is related to physiological sources such as eye blinking, ocular movement and heart beating. The second group consists of mechanical artifacts, such as the movement of electrodes or cables during data collection [25]. Thus, a preprocessing stage is needed to remove all undesired signals and noise. It consists of three different phases, namely: *filtering*, *offset removal* and *artifact removal*. The EEGlab toolbox under the Matlab environment [26] was exploited in this phase.

Since the EEG signals frequencies are within 0.5 and 45 Hz, the *filtering* phase implements a Finite Impulse Response (FIR) passband filter to remove signals with high frequencies and increase signal to noise ratio. The *offset removal* phase eliminates potential offset residues after the filtering phase. The last stage exploits the Artifact Subspace Reconstruction (ASR) algorithm for artifact removal [27].

#### **7.2 Feature Extraction**

Given the preprocessed data, relevant features have to be extracted to train the classification model. For this purpose, temporal ranges of the signals containing relevant events to be analyzed are defined. In this work, the signal was split in different time windows as follows: 15 s after the start of the EEG recording and 15 s before the first failure, divided in 5 s windows. Data recorded during the idle drone's takeoff phase was ignored to avoid exploiting related mental workload measurements as baseline reference in the UAV monitoring experiment. Data in the range just before and after the first failure were not recorded since they may be affected of biases due to the operator's frustration for failing the assigned task. For each window the following features were calculated channel by channel: Power Spectral Density, Mean, Variance, Skewness, Kurtosis, Curve length, Average non-linear energy and Number of peaks [12]. These features were then concatenated in order to make each window corresponds to a row of features appearing in order of channel. Each row was then assigned to a label that states whether the operator failed or not the task for that particular mission.

#### **7.3 Classification**

The aim of this step is to train the classification system considered in this study with the operators' mental workload for predicting their performance in UAVs monitoring operations. Three different models were exploited in this work: two classifiers for predicting the outcome of each mission for each single subject; in the third one, overall data gathered from all operators were used, in order to understand whether a generalized model may be also employed.

A procedure dealing with *feature scaling*, *hyperparameter optimization*, *results validation* and *learning model design*, was proposed in order to judge the model considered from the point of view of accuracy.

**Feature Scaling.** An important issue in signal processing field, and in particular with the EEG data is the high variability of the features extracted from each subject thus their different ranges. An appropriate scaling method is needed in order to normalize all data into the same range. A *z-score* scaler was used as normalization method for subtracting mean values from all measured signals and then dividing the difference by the population standard deviation [28].

**Hyperparameter Optimization and Validation Methodology.** Since the aim of the classification methodology is to have a good accuracy on unseen data, an appropriate validation method becomes necessary in order to measure the generalization error of the implemented model. For this purpose, a k-fold cross validation technique was used to both find the best model with the optimal parameters and test its performance on new unseen data. It consists of samples subdivision in k folds, where k − 1 are used in each iteration to train the model, and the remaining one is used to evaluate the results.

According to this validation methodology, data were divided into three different groups, namely *training set*, *validation set*, and *test set* as follows: 20% as *test set*, and the other 80% as *training* and *validation sets*. A ten-fold cross validation is then performed on *training* and *validation sets* as follows: samples are divided in ten folds, nine of which are used in each iteration to train the model, and the other one is used to evaluate the results. This procedure is then iterated until all folds are used one time as *validation set*. The training accuracy is then evaluated as the mean of all the obtained results in the different iterations. The parameters leading to the best model performance called "*Hyperparameters*" are then selected [29]. Lastly, the model is evaluated using the *test set*.

**Learning Model.** A Support Vector Machine (SVM), which is a learning model able to infer a function from labeled training data, is exploited in this phase to deduce from the operator's EEG workload his ability to succeed or not a mission. It is implemented with two different kernels: linear and Radial Basis Function (RBF). The former is used to find the best hyperplane separation in binary classification problems by tuning the regularization parameter C. The latter is generally used in problems that are not linearly separable and require to find also the best value of the γ parameter [13].

The C parameter is used to regularize and control the bias variance tradeoff. The γ parameter is used to define the variance of the Radial Basis Function (RBF). A grid search using powers of ten from 10−<sup>2</sup> to 10<sup>2</sup> was used to tune the C parameter through the cross-validation phase. For the γ parameter, powers of ten from 10−<sup>4</sup> to 10 were used by considering that bigger values lead to adjust better the model to the training set but bring possible problems of variance or over-fitting. Smaller values may bring bias or under-fitting problems.

### **8 Results and Discussion**

As anticipated, the goal of this paper is to build a UAV operators' mental workload prediction model in order to train a system able to autonomously predict operators' performance in UAVs monitoring operations. To this aim, mental workload data have been collected through a user study.

The study involved 10 participants (8 males and 2 females, aged between 19 to 24), selected from the students of Politecnico di Torino. After a brief training, participants were invited to perform the four tasks M1, M2, M3 and M4 in sequence through the user interface. Such tasks have been specifically designed to test operators' performance in UAVs monitoring operations with an increasing drones' level of risk. Each task, whose length was strictly depending on the operator's piloting choices, took from 2 to 7 min. During each experiment (i.e., all tasks performed), physiological measurements gathered by the EEG signal through the EMOTIV Epoc+-<sup>R</sup> Headset were recorded. The EEG signal was split in different time windows as detailed in Sect. 7.2. For each window, the following features were calculated: Power Spectral Density, Mean, Variance, Skewness, Kurtosis, Curve length, Average non-linear energy and Number of peaks. These features were then concatenated in order to make each window correspond to a row of features appearing in order of channel. Each row was then assigned to a label that states whether the operator failed or not the task for that particular mission. This procedure was performed to generate an heterogeneous population in order to build a classifier able to autonomously predict the label from operators' mental workload measured by EEG signals.

Results obtained in terms of classification algorithm accuracy are reported in Table 1 specifying the hyperparameters used to train each single model. The first ten rows of the table represent the obtained results in the individual model trained using single subject data. The last row shows the overall results using all the collected data. By digging more in details, as shown in Table 1, the fifth and


**Table 1.** Results concerning the accuracy of the classification algorithm for the individual and overall models.

seventh rows present corrupted data that have been discarded for the validation purpose. In those cases, participants only completed one mission successfully, making it very difficult to train the model due to class skewness. As a result, no individual model was trained using those data. However, they were used in the overall model.

The accuracy scores obtained with the ten-fold cross-validation phase (Sect. 7.3) are reported in Table 1 as "Accuracy (Validation set)". The obtained accuracy with new unseen data is reported as "Accuracy (Test set)". It is worth observing that the accuracy scores in these two columns for the same row are not largely different. This observation allows to conclude, that the proposed model is not affected by problems of variance thus performs well if tested with other participants under the same conditions.

Results regarding the accuracy of the test sets show that the linear kernel always perform better or equal than the RBF kernel for individual models. On the contrary, the RBF kernel performs better than linear kernel for the overall model. Specifically, the SVM with the linear kernel is able to predict the operator's performance outcomes thus the level of his/her mental workload with an average accuracy equal to 95.8% and 83.9% when the model is trained on a single user and on all collected data, respectively. Whereas, an accuracy equal to 94.1% and 85.6% is reached with the SVM - RBF kernel when the model is trained using the single user and overall data, respectively. This may be reasonably due to the fact that individual models trained using single subject data are simpler classification problems than those with all collected data.

In this work, the data analysis and classification procedure was performed offline on the data collected through the user study. Future works will be aimed to address alternative procedures in order to allow online evaluation of the data.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Improving Classification Performance by Combining Feature Vectors with a Boosting Approach for Brain Computer Interface (BCI)**

Rachel Rajan(✉) and Sunny Thekkan Devassy

GEC, Thrissur, Kerala, India rachelrajan13@gmail.com, sunnythekkan@rediffmail.com

**Abstract.** In the classification of multichannel electroencephalograph (EEG) based BCI studies, the spatial and spectral information related to brain activities associated with BCI paradigms are usually pre-determined as default without speculation, which can lead to loses effects in practical applications due to indi‐ vidual variability across different subjects. Recent studies have shown that feature combination of each specifically tailored for different physiological phenomena such as Readiness Potential (RP) and Event Related Desynchronization (ERD) might benefit BCI making it robust against artifacts. Hence, the objective is to design a CSSBP with combined feature vectors, where the signal is divided into several sub bands using a band pass filter, and this channel and frequency config‐ urations are then modeled as preconditions before learning base learners and introducing a new heuristic of stochastic gradient boost for training the base learners under these preconditions. Results showed that Boosting approach using feature combination clearly outperformed the state-of-the-art algorithms, and improved the classification performance, resulting in increased robustness.

**Keywords:** Brain computer interface · Motor imagery · Feature combination Spatial-spectral precondition · Stochastic gradient boosting Rehabilitation training

### **1 Introduction**

Brain-computer interfaces (BCIs) provide a communication channel for a user to control an external device using only one's brain neural activity. They can be used as a reha‐ bilitation tool for patients with severe neuromuscular disabilities [7], and also a range of other applications including neural prosthesis, Virtual Reality (VR), internet access etc. Among different types of neuroimaging techniques, electroencephalogram (EEG) is among one of the non-invasive methods exploited mostly in BCI experiments. And, among them event related desynchronization (ERD), visually evoked potential (VEP), slow cortical potential (SCP), and P300 evoked potentials are widely used for BCI studies.

Rachel Rajan M. Tech student; S. Thekkan Devassy Asst. Professor.

<sup>©</sup> The Author(s) 2017 P. Horain et al. (Eds.): IHCI 2017, LNCS 10688, pp. 73–85, 2017. https://doi.org/10.1007/978-3-319-72038-8\_7

In accordance with the topographic patterns of brain rhythm modulations, feature extraction using Common Spatial Patterns (CSP) algorithm [17] provides subjectspecific and discriminant spatial filters. However, CSP has some limitations, as it is sensitive to frequency bands related to neural activity, because of that the frequency band are manually selected or set to a broad band filter. Apart from that, it also results in overfitting problem when dealt with large number of channels. Hence, the problem of overfitting the classifier and spatial filter rises due to trivial channel configuration. Henceforth, a simultaneous optimization of spatial and spectral filter is highly desirable in BCI studies.

Recent years, motor imagery (MI) based BCI has proven to be an independent system with high classification accuracy. Most of the MI based BCI use brain oscillations at mu (8–12 Hz) and beta (13–26 Hz) rhythms, which displays particular areas of event related desynchronization (ERD) [16] each corresponding to respective MI states (such as right hand or right foot motion). Apart from that, Readiness-potential (RP) [18] which is a slow negative event-related potential that appears before a movement is initiated can also be used as input to BCI to predict future movements. RP is mainly divided into early RP and late RP. Early RP is slow negative potential that begins 1.5 s before action, which is immediately followed by late RP that occurs 500 ms before the movement. In MI based BCI, combining of features vectors [5] i.e., ERD and RP have shown a signif‐ icant boost in the classification performance.

In the literature, several number of sophisticated CSP based algorithms have been witnessed especially in the BCI study. A brief review has been presented here. Taking into account of avoid overfitting and selection of optimal frequency bands for CSP algorithm, various methods were proposed. To avoid overfitting problem, Regularized CSP (RCSP) [13] was proposed, in which the regularization information was added into the CSP learning procedure. The Common Spatio-Spectral Pattern (CSSP) [11] is an extension of CSP algorithm with time delayed sample. However, due to flexibility issues the Common Sparse Spectral-Spatial Pattern (CSSSP) [6] was presented, where its FIR filter consists of single time delay parameter. Since, these methods were computationally expensive, a Spectrally-weighted Common Spatial Pattern (SPEC-CSP) [19] was designed which alternatively optimizes the temporal filter in frequency domain and then the spatial filter in the iteration process. To improve the performance of SPEC-CSP, Iterative Spatio-Spectral Pattern Learning (ISSPL) [22] was proposed which does not rely on statistical assumptions and optimizes all temporal filters under a common opti‐ mization framework.

Despite of various studies and advanced algorithm, it is still a challenge to extract optimal spatial spectral filters for BCI studies, so as to be used as a rehabilitation tool especially for disabled subjects. The spatial and spectral information related to brain activities associated with BCI paradigms are usually pre-determined as default in EEG analysis without speculation, which can lead to loses effects in practical applications due to individual variability across different subjects. Hence, to solve this issue, a CSSBP [12] with combined feature vectors is designed for BCI based paradigms, since the combination of features each corresponding to different physiological phenomena such as Readiness Potential (RP) and Event Related Desynchronization (ERD) can benefit BCI making it more robust against artifacts from non-Central Nervous System (CNS) activity such as eye blinks (EOG) and muscle movements (EMG) [5]. At first, the EEG signal is first divided into several sub bands using a band pass filter, then the channel and frequency bands are modeled as preconditions before classifying and a heuristic of stochastic gradient boost is used to train the base learners under these preconditions. The effectiveness and robustness of the designed algorithm along with feature combi‐ nation is evaluated on widely used benchmark dataset BCI competition IV (IIa). The remaining part of the paper is organized as follows; a detailed design of proposed Boosting Algorithm is given in Sect. 2, performance comparison results shown in Sect. 3. Finally, conclusion is given in Sect. 4.

### **2 Proposed Algorithm**

Under this section, a combination model of CSSBP (common spatial spectral boosting pattern) with feature combination is given in detail; it includes modeling the problem, and learning algorithm for the model. The model consists of five stages, data prepro‐ cessing which includes multiple spectral filtering by decomposing the signal into several sub bands using a band pass filter and spatial filtering, feature extraction using common spatial pattern (CSP), feature combination, training the weak classifiers, and pattern recognition with the help of a combinational model. The architecture of the designed

**Fig. 1.** Block diagram of proposed boosting pattern

algorithm is shown in Fig. 1. The EEG data is firstly spatial filtered and band pass filtered under multiple spatial-spectral preconditions.

Afterwards, the CSP algorithm is applied to extract features of the EEG training dataset and combine these feature vectors, then the weak classifiers {*fm*}*<sup>M</sup> m*=<sup>1</sup> , are trained and combined to a weighted combination model. Lastly, a new test sample *x̂* is classified using this combination model.

#### **2.1 Problem Design**

During BCI studies, the two main concerns are the channel configuration and frequency band, which are predefined as default for implementing EEG analysis. But, predefining these conditions without deliberations leads to poor performance while executing it in a real scenario due to subject variability in EEG patterns. Hence, an efficient and robust configuration is desirable in case of practical applications.

To model this problem, let us denote the training dataset as *Etrain* = (*xi* , *yi* ) *N i*=1 , where *Ei* is the ith sample and *yi* is its corresponding label. The main aim is to find a subset ω c ν, by using a set of all probable preconditions ν, which generates a combination model *F* by incorporating all sub models trained under condition WM (WM ϵ ω) and reducing the misclassification rate on the train dataset *Etrain*, given by,

$$\text{iso} = \arg\min\_{\mathbf{u}} \frac{1}{N} \left| \text{Ei} : \mathbf{F}(\mathbf{x}\_i, \mathbf{o}) \neq \mathbf{y}\_{i=1}^N \right| \tag{1}$$

In the following part of this section, 2 homogeneous problems are modeled in detail and then an adaptive boosting algorithm is designed to solve them.

**Spatial Channel and Frequency Band Selection.** For channel selection, the aim is to select an optimal channel set S(S ⊂ U), where U is the universal set including all possible channel subsets for set of channels C so that each subset Um in U satisfies |Um| ≤ |C| (here |.| is used to represent the size of the corresponding set), which produces an optimal combination classifier F on the training data by combining base classifiers learned under different channel set preconditions. Therefore, we get,

$$F(E\_{train}; \mathcal{S}) = \sum\_{\mathcal{S}\_n \in \mathcal{S}} a\_m f\_m(E\_{train}; \mathcal{S}\_m) \tag{2}$$

Where F is the optimal combination model, *fm* is mth sub model learned with channel set precondition *Sm*, *Etrain* is the training dataset, and *m* is combination parameter. The original EEG *Ei* is multiplied with the obtained spatial filter, to obtain a projection of *Ei* on channel set *Sm*, which is the alleged channel selection. In the simulation work, 21 channels were selected, denoted as universal set of all channels, C = (CP6, CP4, CP2, C6, C4, C2, FC6, FC4, FC2, CPZ, CZ, FCZ, CP1, CP3, CP5, C1, C3, C5, FC1, FC3, FC5), where each one indicates an electrode channel.

For frequency band selection, the spectra denoted as G is simplified as a closed interval, where the elements are all integer points (e.g., G is Hz). Here G is split into various sub-bands B and D as given in [12, 14], which denotes a universal set composed of all possible sub-bands. While selection of optimal frequency band, the objective is to obtain an optimal band set B (B ⊂ D), so that an optimal combination classifier on the training data is produced.

$$F(E\_{\text{train}}; \mathcal{B}) = \sum\_{B\_n \in \mathcal{B}} a\_n f\_m(E\_{\text{train}}; \mathcal{B}\_m) \tag{3}$$

Where *fm* is mth weak classifier learned by sub-band *Bm*. In the simulation study, a fifth order zero phase forward/reverse FIR filter was used to filter the raw EEG signal *Ei* into sub bands *Bm*.

#### **2.2 Model Learning Algorithm**

Here, the models of channel selection and frequency selection are combined to form a two-tuple, ϑ<sup>m</sup> = (Sm,Bm), it is used to denote a spatial-spectral precondition, and ν is represented as a universal set including all these spatial-spectral preconditions. Lastly, the combination function can be computed as

$$F(E\_{t\_{\text{train}}}; \mathfrak{G}) = \sum\_{\mathfrak{G}\_{\mathfrak{n}} \in \mathfrak{G}} a\_{\mathfrak{m}} f\_{\mathfrak{m}}(E\_{t\_{\text{train}}}; \mathfrak{G}\_{\mathfrak{m}}) \tag{4}$$

Hence, for each spatial-spectral precondition ϑ<sup>m</sup> ∈ ϑ, the training dataset *Etrain* is filtered under ϑm. The CSP features are obtained by the filtered training dataset *Etrain* and these features of individual physiological nature were combined using PROB method [1]. Let us denote the N features by random variables *Xi* , i = 1,…, N having class labels as Y <sup>∈</sup> {±1}. An optimal classifier *fi* is defined for each feature i on the single feature space *Di* hence reducing the misclassification rate. Let gi,y denote the density of fi ( Xi |Y = y ) for each i and labels say y = +1 or −1. Then f is the optimal classifier on the combined feature space D = (D1, D2,…, DN), and X is the combined random vari‐ able X = (X1, X2,…, XN), densities of f (X |Y = y) is given by gy, hence under the assumption of equal class prior for x <sup>=</sup> ( x1, x2,…, x<sup>N</sup> ) ∈ D,

$$f\_i(\mathbf{x}\_i; \boldsymbol{\gamma}(\theta\_i)) = 1 \leftrightarrow \hat{f}\_i(\mathbf{x}\_i; \boldsymbol{\gamma}(\theta\_i)) := \log \left( \frac{\mathbf{g}\_{i,1}(\mathbf{x}\_i)}{\mathbf{g}\_{i,-1}(\mathbf{x}\_i)} \right) > 0 \tag{5}$$

Where γ is the model parameter determined by *i* and *Etrain*, and incorporating inde‐ pendence between the features to the above equation results in an optimal decision function given by,

$$f(\mathbf{x}; \boldsymbol{\chi}(\boldsymbol{\theta})) = 1 \leftrightarrow \hat{f}(\mathbf{x}; \boldsymbol{\chi}(\boldsymbol{\theta})) = \sum\_{i=1}^{N} \hat{f}\_i(\mathbf{x}\_i \boldsymbol{\chi}(\boldsymbol{\theta}\_i)) > 0 \tag{6}$$

In this, the assumption is that, for each class the features are Gaussian distributed with equal covariance, i.e., *Xi* <sup>|</sup>*Y* <sup>=</sup> *yN*( *i*,*y*, ∑ *i* ) , with *wi* : <sup>=</sup> <sup>∑</sup>−<sup>1</sup> *i* (*i*,<sup>1</sup> <sup>+</sup> *i*,−<sup>1</sup>), then the classifier,

$$f(\mathbf{x}; \boldsymbol{\gamma}(\boldsymbol{\theta})) = 1 \leftrightarrow \hat{f}(\mathbf{x}; \boldsymbol{\gamma}(\boldsymbol{\theta})) = \sum\_{i=1}^{N} \left[ \boldsymbol{w}\_{i}^{T} \mathbf{x}\_{i} - \frac{1}{2} \left( \boldsymbol{\mu}\_{i,1} + \boldsymbol{\mu}\_{i,-1} \right)^{T} \boldsymbol{w}\_{i} \right] > 0 \tag{7}$$

Then obtained weak classifier can be rewritten as *fm* ( *Etrain*;ϑ<sup>m</sup> ) , which is trained using the boosting algorithm. Thus, the classification error defined earlier can be formulated as,

$$\left\{\left\{a,\,\theta\right\}\right\}\_{0}^{M} = \min\_{\left\{a,\,\theta\right\}\_{0}^{M}} \sum\_{i=1}^{N} L\left(\mathbf{y}\_{i}, \sum\_{m=0}^{M} a\_{m} f\_{m}\left(\mathbf{x}\_{i}; \boldsymbol{\eta}(\Theta\_{\mathbf{m}})\right)\right) \tag{8}$$

A Greedy approach [8] is used to solve (8), which is given in detail below,

$$F\left(E\_{t\text{min}},\boldsymbol{\chi},\left\{\boldsymbol{a},\boldsymbol{\theta}\right\}\_{0}^{M}\right) = \sum\_{m=0}^{M-1} a\_{m} f\_{m}\left(E\_{t\text{min}};\boldsymbol{\chi}\left(\boldsymbol{\theta}\_{m}\right)\right) + a\_{M} f\_{M}\left(E\_{t\text{min}};\boldsymbol{\chi}\left(\boldsymbol{\theta}\_{M}\right)\right) \tag{9}$$

Transforming the Eq. (9) into a simple recursion formula we get,

$$F\_m(E\_{train}) = F\_{m-1}\left(E\_{train}\right) + a\_m f\_m\left(E\_{train}; \gamma\left(\mathfrak{G}\_m\right)\right) \tag{10}$$

We suppose, *Fm*<sup>−</sup><sup>1</sup> ( *Etrain*) is known, then *fm* and *m* can be determined by,

$$F\_m(E\_{\text{train}}) = F\_{m-1}\left(E\_{\text{train}}\right) + \arg\min\_f \sum\_{i=1}^N L\left(\mathbf{y}\_i, \left[F\_{m-1}\left(\mathbf{x}\_i\right) + a\_m f\_m\left(\mathbf{x}\_i; \boldsymbol{\chi}(\boldsymbol{\theta}\_\mathbf{m})\right)\right]\right) \tag{11}$$

The problem in (11) is solved by using a steepest gradient descent [9], and the pseudoresiduals are given by,

$$\begin{split} r\_{\pi(i)m} &= -\nabla\_F L(\mathbf{y}\_{\pi(i)}, F(\mathbf{x}\_{\pi(i)})) \\ &= -\mathbb{I}\frac{\partial L(\mathbf{y}\_{\pi(i)}, F(\mathbf{x}\_{\pi(i)}))}{F(\mathbf{x}\_{\pi(i)})}\Big|\_{F(\mathbf{x}\_{\pi(i)}) = F\_{m-1}(\mathbf{x}\_{\pi(i)})} \end{split} \tag{12}$$

Here, the first *N̂* elements of a random permutation of {*i*}*<sup>N</sup> i*=1 are given by {(*i*)}*N̂ i*=1 . Henceforth, a new set {(*x*(*i*),*r*(*i*)*m*)}*<sup>N</sup> i*=1 , which signifies a stochastically partly best descent step direction, is produced and employed to learn γ(ϑm) given by,

$$\gamma(\mathfrak{G}\_{\mathfrak{m}}) = \arg\min\_{\gamma,\rho} \sum\_{i=1}^{\hat{N}} \left[ r\_{\pi(i)\mathfrak{m}} - \rho f \left( \chi\_{\pi(i)}; \chi\_{\mathfrak{m}}(\mathfrak{G}\_{\mathfrak{m}}) \right) \right] \tag{13}$$

The combination coefficient *m* is obtained with *m*(ϑm) as,

$$a\_m = \arg\min\_a \sum\_{i=1}^N L\left(\mathbf{y}\_i, \left[F\_{m-1}\left(\mathbf{x}\_i\right) + a f\_m\left(\mathbf{x}\_i; \boldsymbol{\eta}\left(\boldsymbol{\Theta}\_m\right)\right)\right]\right) \tag{14}$$

Here, each weak classifier *fm* is trained under a random subset {(*i*)}*<sup>N</sup> i*=1 (without replacement) from the full training data set. This random subset is used instead of the full sample, to fit the base learner as shown in Eq. (13) and the model update is computed using Eq. (14) for the current iteration. During the iteration, a self-adjusted training data pool P is maintained at background, given in detail in Algorithm 1. Then, the number of copies is computed using local classification error and these copies of incorrectly classified samples are then added to the training data pool.

#### **2.3 Algorithm 1: Architecture of Proposed Boosting Algorithm**

Input: The EEG training dataset given by {x<sup>i</sup> , yi }N i=1 , L(y, x) is the squared error loss function, number of weak learners denoted by M, and ν is the set of all preconditions.


$$\{\mathfrak{a}(\mathfrak{i})\}\_{\mathfrak{i}=\mathfrak{l}}^{|\mathcal{P}\_{m-\mathfrak{l}}|} = \text{randperm}(\mathfrak{i})\_{\mathfrak{i}=\mathfrak{l}}^{|\mathcal{P}\_{m-\mathfrak{l}}|}$$


Output: F is the optimal combination classifier, weak learners obtained as {fm}<sup>M</sup> m=1 , where {αm}<sup>M</sup> m=1 is the weights of weak learners and {ϑm}<sup>M</sup> m=1 is the preconditions under which these weak learners are learned.


Re-adjust the training data pool:


$$\mathbf{F\_m(E\_{train})} = \mathbf{F\_{m-1}(E\_{train})} + \alpha\_{\mathbf{m}} \mathbf{f\_m} \left( \mathbf{E\_{train}}; \boldsymbol{\gamma}(\boldsymbol{\theta\_m}) \right),$$

F. end for.

(11) end for.


With the help of Early stopping strategy [23], the iteration time M is determined to avoid overfitting, using N*̂* <sup>=</sup> <sup>N</sup>, doesn't introduce randomness, hence smaller N*̂* <sup>N</sup> fraction, incorporates more overall randomness into the process. In this work, N*̂* <sup>N</sup> <sup>=</sup> 0.9 and a comparably satisfactory performance is obtained for the above approximation. While adjusting P, the copies of incorrectly classified samples, d is computed by the local classification error, e = | |TFalse | | <sup>N</sup> is given by,

$$\mathbf{d} = \max(1, \left\lfloor \frac{1 - \mathbf{c}}{\mathbf{c} + \mathbf{f}} \right\rfloor) \tag{15}$$

Here, the parameter ∈ is called as accommodation coefficient, and e is always less than 0.5, and decreases during the iterations, so that large weights on samples will be given which were incorrectly classified by strong learners.

### **3 Result**

The robustness of the designed algorithm was assessed on dataset obtained from BCI competition IV (IIa) dataset [2]. In order to remove artifacts obtained from eye and muscle movements, FastICA was employed [15]. For comparing the performance and efficiency of the designed algorithm, Regularized CSP (RCSP) [13] was used for feature extraction. In this, model parameter λ for RCSP, were chosen on the training set using a Hold Out validation procedure. In case of the four-class motor imagery classification task for dataset II, one-versus-rest (OVR) [21] strategy was employed for CSP. PROB method [1] was utilized for feature combination which incorporates independence between ERD and LRP features. Feature selection was done to select relevant features, since as more features cannot improve the training accuracy. Here ‖ ‖ 2

feature selection was done using Fisher score (a variant, J =

‖μ+ − μ− ‖ σ+ + σ− ) [10], it

makes selection by measuring the discrimination of individual feature in the feature vector for classification. Then the features with largest fisher score are selected as most discriminative features. Linear Discriminant Analysis (LDA) [4] which mini‐ mizes the expected risk of misclassification rate was utilized for classification.

Here, the most optimal channel using [20] for all four MI movements i.e., left hand, right hand, foot and tongue were CP4, Cz, FC2, and C1. The 2-D topoplot maps of peak amplitudes of boosting based CSSP filtered EEG in each electrode for subject S1 is shown in Fig. 2.

**Fig. 2.** 2-D topoplot maps of peak amplitude of Boosting based CSSP filtered EEG in each channel for subject S1 in BCI competition IV (II a) dataset.

To compute the spatial weight for each channel, the quantitative vector, <sup>L</sup> <sup>=</sup> <sup>∑</sup> Si∈<sup>S</sup> α<sup>i</sup> Si [17] was used where Si is the channel sets and α<sup>i</sup> are their weights. The spectral weights were computed as given in [12] and then projected onto the frequency bands. In addition, the temporal information were also obtained and visualized. The training dataset are preprocessed under the spatial-spectral pre-condition ϑ<sup>m</sup> ∈ ϑ, which results in a new dataset on which spatial filtering is done using CSP to obtain the spatial patterns. Then the first two components obtained by CSP are projected onto the space yielding the CSP filtered signal Em. The peak amplitude PmCi for Em and each channel C<sup>i</sup> ∈ C. Then the PmCi is averaged over all set of preconditions ϑ<sup>m</sup> ∈ ϑ, computed as PC<sup>i</sup> = ( <sup>1</sup> || ) ∑ <sup>ϑ</sup>m∈ϑ αmPmC<sup>i</sup> where αm is the corresponding weight for the mth condition, which is then visualized using a 2-D topoplot map. From the topoplot, it can be observed that the left hand and right hand movement resulted in activation over the right and left hemisphere of the brain, the foot movement activated the central cortical area and tongue showed activation in the motor cortex region.

The classification results of the test dataset for the proposed method and the other competing method i.e., Regularized CSP (RCSP) is detailed as follows. In all the subjects the maximum number of iterations, M of the boosting algorithm was set to 180, which was computed using early stopping strategy so as to avoid overfitting, and ϵ was set to 0.05. The cohen's kappa values for all 9 subjects in the BCI IV(IIa) dataset is shown in Fig. 3. In case of dataset 2, the CSSBP outperformed the RCSP algorithm and showed highest average cohen's kappa value [3]. From the kappa values, it can be seen that when feature vectors are combined in RCSP algorithm, there was a significant improvement in kappa values in all subjects (except for subjects S4 and S6).

**Fig. 3.** Cohen's kappa values for all the 9 subjects in BCI IV (II a) dataset, where A is RCSP, B is RCSP with combined feature vectors, C is Boosting based CSSP (CSSBP), and D is Boosting based CSSP (CSSBP) with combined feature vectors.

Whereas the proposed method improved the kappa values compared to the above algorithm and moreover when feature vectors were combined, it outperformed CSSBP with single feature when compared with combined feature vectors. The statistical anal‐ ysis was done using IBM SPSS ver. 23., it showed significant difference between designed method and the other methods used for comparison in a Mann-Whitney U test. For all the cases, the designed method outperformed for level of significance p < 0.05, as shown in Fig. 4.

**Fig. 4.** Boxplots of RCSP and Boosting Approach, where A is RCSP, B is RCSP with combined feature vectors, C is CSSBP, and D is CSSBP with combined feature vectors for BCI IV (IIa) dataset (p < 0.05).

### **4 Conclusion**

In this work, a boosting based common spatial-spectral pattern (CSSBP) algorithm with feature combination has been designed for multichannel EEG classification. Here, the channel and frequency configurations are divided into multiple spatial-spectral precon‐ ditions by using a sliding window strategy. Under these preconditions, the weak learners are trained using a boosting approach. The motive is to select the most contributed channel groups and frequency bands related to neural activity. From the results, it can be seen that the CSSBP clearly outperformed the other method use for comparison. In addition, combining the widely used feature vectors ERD and readiness potentials (RP) significantly improved the classification performance compared to CSSBP and resulted in increased robustness.

The PROB method was utilized which incorporates independence between ERD and LRP features enhanced the performance. This can also be used to better explore the neurophysiological mechanism of underlying brain activities. Feature combination of different brain tasks in feedback environment, where the subject is trying to adapt with the feedback scenario might cause the learning process complex and time consuming, so for that this process needs to investigate further in future online BCI experiments.

**Acknowledgements.** The authors would like to thank Fraunhofer First, Intelligent Data Analysis Group, and Campus Benjamin Franklin of the Charite' - University Medicine Berlin (http:// www.bbci.de/competition/iii), and the Institute for Knowledge Discovery (Laboratory of Brain-Computer Interfaces), Graz University of Technology (http://www.bbci.de/competition/iv), for providing the dataset online.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **LINEUp: List Navigation Using Edge Menu Enhancing Touch Based Menus for Mobile Platforms**

Rana Mohamed Eisa1(B) , Yassin El-Shanwany<sup>1</sup>, Yomna Abdelrahman<sup>2</sup>, and Wael Abouelsadat<sup>1</sup>

> <sup>1</sup> German University in Cairo, Cairo, Egypt *{*rana.monir,wael.abouelsaadat*}*@guc.edu.eg, yassin.el-shanwany@student.guc.edu.eg <sup>2</sup> University of Stuttgart, Stuttgart, Germany Yomna.abdelrahman@vis.uni-stuttgart.de

**Abstract.** Displaying and interacting with Cascaded Menus on mobile phones is challenging due to the limited screen real estate. In this paper, we propose the *Edge Menu* – U-shaped layout displayed along the edges of the screen. Through the use of transparency and minimum screen space, the Edge Menu can be overlaid on top of existing items on the screen. We evaluated the suitability of two versions of the Edge Menu: List and Nested Menus. We compared the performance of the Edge Menu to the traditional Linear Menu. We conducted three studies and revealed that Edge Menu can support the use of single hand and both hands, it outperforms the regular Linear Menu, and is in average 38.5% faster for Single hand usage, and 40% faster for Dual hands usage. Edge Menu using both hands is in average 7.4% faster than Edge Menu using Single hand. Finally, the Edge Menu in Nested Menus shown to be faster than Linear Menus in Nested Menus with 22%–36%.

**Keywords:** Cell phones *·* Edge Menus *·* Linear Menus Nested Menus *·* Gestures *·* Mobile interaction *·* Menu techniques Mobile phone menus

### **1 Introduction**

Mobile phones are used today to perform various functions and are not limited to making voice calls only. Users are manipulating images and videos, writing documents, broadcasting events, and even creating and editing 3D models on mobile phones. The processing capabilities of some recent mobile phones is similar to that of laptops, which makes them suitable for performing any task. However, the limited screen real estate on mobile presents itself as the biggest obstacle against the utilization of the underlying hardware and sophisticated software applications. There is currently over 1 billion smartphones worldwide [2]. Hence, it is no exaggeration to claim that navigating through lists generally is one of the most frequently performed daily tasks.

In this paper, we describe our work aiming to enhance menu navigation on mobile phones. We conducted three studies. In the first two studies, we explored one of the regularly visited lists, Contacts' List. Since, calling a previously stored phone number in the Contacts' List is one of the most commonly performed daily tasks. Although the current design of the Contacts' List, in Android and in iPhone, seems adequate to most users, yet we believe it will be soon challenged by the rapidly increasing number of entries. As the current trend of merging social contacts with phone contacts in one list continues to spread, the average number of entries is expected to rise rapidly. A typical Internet user has about 600 social ties [16]. In Facebook, the mean number of friends among adult users is 338 and the median comes in at 200 friends [1]. At this rate, Contacts' Lists with several hundred entries, will gradually become the norm. At the moment, finding a contact can be done using speed dial, search by voice, search by typing name and using the menu. Each of these interaction techniques suits a specific context. For instance, while search by voice might be the fastest way to dial a contact, it requires the user to be in a relatively quite environment.

Moreover, many software applications have complex features which are organized into deeply Nested Menu structures. This renders them unusable on mobile screens - where the limited screen size would make the display of such menus impossible.

In our third study, we developed the Edge Menu as a proposed solution to this problem. The Edge Menu displays each level of a Nested Menu on one side of the screen and the user alternates between left and right edges while navigating in the menu using symmetric bi-manual interaction.

In this research, we aim to enhance menu navigation through the following contributions:


### **2 Related Work**

Menu Navigation is still an open topic that has many usability issues that need more investigation and research. Our work builds on strands of prior work: (1) Menus Design, (2) List Navigation task, (3) Contacts' List usage and (4) Edge Screen.

### **2.1 Menus for Mobile**

Several researchers have developed menus which attempt to speed up selection in large sets of items presented on a small cellphone screen. Kim et al. [22] developed a 3D Menu which utilizes the depth cue. The researchers' formal evaluation reveals that as the number of items gets larger and the task complexity is increased, the 2D organization of items is better. For a small number of items, the 3D Menu yields a better performance. Foster and Foxcroft [13] developed the Barrel Menu which consists of three horizontally rotating menu levels stacked vertically; the top level represents the main menu items. Selecting an item from a level is achieved by rotating the level left or right, resulting in the child elements of the current item being displayed in the menu level below. Francone et al. [14] developed the Wavelet Menu, which expands upon the initial Marking Menus by Kurtenbach and Buxton [21]. Bonnet and Appert [7] proposed the Swiss Army Menu which merges standard widgets, such as a font dialog, into a Radial Menu layout. Zhao et al. [35] used an Eyes-Free Menu with touch input and reactive auditory feedback.

### **2.2 List Navigation**

Menus used in mobile phones are influenced by Linear Menus which were originally created for desktop graphical user interfaces (**GUI**). Such menus suit desktop environments, where large screen size can accommodate displaying more items. However, Linear Menus are not a good option for a mobile phone interface, as the screen is much smaller. Smartphone users are forced to do excessive scrolling to find an item in a Linear Menu since the screen can only display a handful of items at a time. Almost all menus are formatted in a linear manner, listing entries that are arranged from the top to the bottom of the screen. When presenting a list of items to the user, the available hardware and software have limited the computer system architecture to a linear format. Pull-Down Menus and Pop-Up Menus are a typical example of the linear arrangement. Most of these menus are either static on the screen or are activated from a specific mouse action [9].

#### **2.3 Contacts' List**

The Contacts' List has been the focus of several research works. Oulasvirta et al. [26] recommended augmenting each entry with contextual cues such as user location, availability of user, time spent in location, frequency of communication and physical proximity. Jung, Anttila and Blom [19] proposed three special category views: communication frequency, birthday date, and new contacts. This is meant to differentiate potentially important contacts from the rest. Bergman et al. [6] modified the Contacts' List to show unused contacts in smaller font at the bottom of the list. Plessas et al. [27] and Stefanis et al. [29] proposed using the call log data and a predictive algorithm for deciding which entries are most likely to be called at any specific time. Campbell et al. [24] utilized an EEG signal to identify the user choice. Ankolekar et al. [4] created Friendlee, an application which utilized call log and social connections to enable faster access to the sub-social-network reachable via mobile phone, in addition to contacts.

#### **2.4 Utilizing Screen Edge and Bezel**

Apple Macintosh was the first to utilize the screen edge by fixating the menu bar at the top edge of the screen. Wobbrock developed the Edge Keyboard [32,34], where the character buttons are placed around the screen's perimeter, and could be stroked over or tapped like ordinary soft buttons. More recently, screen edge and bezel have attracted the attention of researchers to enable richer interactions on mobile. Li and Fu [23] developed the BezelCursor which is activated by performing a swipe from the bezel to the on-screen target. The BezelCusor supports swiping for command invocation as well as virtual pointing for target selection in one fluid action. Roth and Turner [28] utilized the iPhone bezel to create the Bezel Swipe. Crossing a specific area in the screen edge towards the inside activates a particular functionality. The user continues with the gesture to complete the desired operation. Chen et al. [10] utilized the bezel to enhance copy and paste. Based on Fitts' Law, nearer bigger targets are faster to reach to, compared to farther smaller ones. Thus, the target's size is an important parameter to take into consideration, because the larger the target is; the faster, easier and more efficient the target's selection is [11]. Jain and Balakrishnan [18] have proven the utility of bezel gestures in terms of efficiency and learnability. Hossain et al. [17] utilized the screen edges to display proxies of off-screen objects to facilitate their selection. Recently Samsung provided *Samsung Galaxy Edge* series, a mobile phone with a 3D melted glass that covered the curves of the mobile phone [3]. This design has a huge potential, which supports our research even more; seeking to prove that the Edge Menu Design is more usable than the regular Linear Menu.

In this work, we aim to evaluate the new Edge Menu design to enhance the navigation performance on smartphones. Namely we focus on three main research questions (**RQ**):


### **Menu Design and Interaction**

Our main goal was to enable the quick selection of an entry in a list and speed up the navigation in a Nested Menu. While in previous works, researchers redefined the layout of the menu list totally, our strategy is to preserve the linear organization of entries and focus on speeding up the interaction.

To achieve this, we designed three user studies, for the first two studies we conducted two experiments that focus in Contacts' Lists. The main goal of any user is to speed up the selection process of the target name. Thus, selecting the first letter of both the first name and the last name is the most efficient technique to narrow down the Contacts' List as quickly as possible. Although not all users store the first and the last name of a contact, the same technique is applicable to contacts with just a single entry stored. In the latter case, the first two letters of the entry will be utilized in the search - this is further to be utilized in further studies.

Later, for the third study we ran an experiment to enhance the search in Nested Menus same way we aim to enhance One-Level Menus. Although redesigning menus might result in efficient interaction, yet our approach would enable the porting of existing applications to the mobile platforms with less effort. We formulated three guiding design goals;


Although users prefer Single-hand interaction [20], two-handed input has proven quicker [22,25]. We anticipate that the overwhelming number of contacts might require the user to utilize two hands to reach the target entry faster. The second design goal was to minimize finger time travel distance on the screen. Fitts' Law teaches that movement time is inversely correlated with distance to target and to width of target [12]. The third design goal was to make use of the screen edges since user's fingers are often located there while holding the phone. Walker and Smelcer [31] and Froehlich et al. [15] have shown that utilizing an edge as a stopping barrier improves target acquisition time.

Our design effort yielded a menu fitted to the edges which makes it easily reachable using single hand and two hands. Two variations were developed to support the design goals. Since performance difference could be attributed to more than one factor, we opted for implementing simpler designs supporting only a single design goal for comparison purposes. In this paper our focus is to investigate if single and multi-level Edge Menu designs will work better than Linear Menu designs, with Single hand and Dual hands.

#### **2.5 Layout Design**

**Linear Menu.** Since Android based phones already have a Linear Menu used in the Contacts' List application, we were interested in using it as a baseline and to investigate the difference in performance between the different designs, (see Fig. 1). We implemented the Linear Menu in our system following the same interaction style as offered by Android OS. To support selecting both the first name and the last name, we extended the selection mechanism to accept two letters instead of one. Thus the user would need to tap twice for the two first letters. It is worth noting that in Android 2.2, the Contacts' List had a feature to select both first and last names. The user would start by selecting the first letter of the first name, then continue by swiping the finger horizontally for

**Fig. 1.** Linear Menu with flicking support

**Fig. 2.** Edge Menu with flicking support

a short distance and next move vertically - either upward or downward - to select the first letter of the last name. Although this feature was dropped from later versions of Android, we felt it is more appropriate to utilize an interaction mechanism which supports selection of the first two letters to be comparable with our design.

**Edge Menu.** An Edge Menu consists of a U shaped panel fitted to the left, right and bottom edges of the screen, (see Figs. 2 and 10). For the purpose of the Contacts' List, the menu items are the alphabetical letters and for the purpose of the Nested Menu, the menu items are the default menu icons. We decided not to use the upper edge since it is the furthest away from the user's fingers. For the first study, we decided to use names with first and last names not first names only to make the study consistent, the later case will be supported in future studies. The user taps on the first letter of the first name followed by a tap on the first letter of the last name. This narrows down the choices for the user. Scrolling through the results is done by flicking up and down. This menu design was motivated by the first design goal which is to support both two handed and single handed interaction, and the third which is to use screen edges.

#### **2.6 Interaction Design**

**Linear Menu with Wheel.** This menu consists of two components: a linear list of alphabet letters placed in the right edge of the screen and a wheel for scrolling at the bottom (see Fig. 3). To select an entry, the user starts by choosing the first letter of the first name and next select the first letter of the second name from the menu. Next, the user scrolls through the narrowed down results using the wheel provided. Holding the phone in one hand, the wheel lies where the user would rest his thumb. This menu design was motivated by the second design goal to minimize finger travel distance. We speculated that the slowest part of the interaction is scrolling up and down to locate an entry. Since the user is unaware of the exact location of the contact, the employed flicking either overshoots or undershoots the location of the desired entry. Tu et al. [30] compared flicking to radial scrolling and found that radial scrolling led to shorter movement time than flicking for larger target distance. However, it was not clear if using the thumb is efficient since Wobbrock et al. [33] has reported that the index finger is generally faster. This menu design was motivated by the second design goal which is minimize travel distance but focused on the interaction with the narrowed down list.

**Fig. 3.** Linear Menu with radial control for scrolling

**Fig. 4.** Edge Menu with radial control for scrolling

**Edge Menu with Wheel.** This design is similar to the Edge Menu but augmented with a wheel for scrolling through the results list (see Fig. 4). After choosing the first letter, a wheel is displayed in proximity to the last position of the user's finger. The user scrolls through the list of contacts by moving the finger in a circular motion on the wheel - following the same interaction style as in the Linear Menu with wheel. Clockwise movement causes scrolling down and anti-clockwise movement signals scrolling up. The speed of the rotation governs how fast the scrolling of names occurs. The user does not have to maintain his finger within the wheel border as any radial movement above or close to it, activates the scrolling. Finally, the user taps on the desired contact. This menu design attempts to support the three stated design goals.

#### **2.7 Pre Study: Observing Mobile Holding Position**

We observed people in public areas, while holding their mobile phones, to observe the most common, comfortable position to hold their mobile phones. After observing many samples of people, almost all people grabbed their phones in a position where the phone's back rests on the users' palms (see Fig. 5).

**Fig. 5.** Most habitual holding position of a cellphone

### **3 Study I: Evaluating Edge Menus Layout and Interaction Techniques**

To answer **RQ** and to test our hypothesis of whether using Edge Menu instead of Linear Menu improves the user's performance or not? We started 3 studies sequentially.

Our goal with the evaluation was to find which menu is most efficient while working with a large-size list. A secondary goal was to understand the importance of our design goals and decide which is most relevant for future design efforts.

#### **3.1 Design**

We applied a repeated-measures design, where all participants were exposed to all conditions. An application displaying the menus and measuring user performance was implemented. The study has two independent variables, specifically the menu type with four levels; *Linear Menu*, *Edge Menu*, *Linear Menu with Wheel* and *Edge Menu with Wheel*, and the list size with three levels; *201 entries*, *300 entries* and *600 entries*; and two dependent variables the mean execution time and error rate. The latter is defined as the percentage of trials with an incorrect selection of a target name. The mean execution time, is defined as the time between the display of a target name to the participant and the participant tapping on that name in the Contacts' List. The order of the conditions was counter-balanced to avoid any learning effects. The study time was around 60–120 min plus 3 min for the training trials.

### **3.2 Apparatus**

Our experimental setup consisted of a Samsung S3 device with a 4.8 in. (1280 *×* 720) display running Android 4.0.

### **3.3 Participants and Procedure**

We recruited 36 participants (18 females) and (18 males) with an average age of 26 years (SD = 2.27) using university mailing lists. Four of the participants were left-handed. None of the participants had any previous experience using Edge Menus. After arriving in the lab and welcoming the participants, they signed a consent form and received an explanation of the purpose of the study.

We equally divided the participants to 3 groups each with 12 participants. First group was tested using 201 contacts; we divided them accordingly to 3 blocks, each having 67 trials. Thus each participant performed 804 trials: (4 menus *×* 67 trials *×* 3 blocks). The total number of trials in the experiment was *9648* (12 participants *×* 804 trials).

Second group was tested using 300 contacts; we divided them accordingly to 3 blocks, each having 100 trials. Thus each participant performed 1200 trials: (4 menus *×* 100 trials *×* 3 blocks). The total number of trials in the experiment was *14400* (12 participants *×* 1200 trials).

Finally, the Third group was tested using 600 contacts; we divided them accordingly to 3 blocks, each having 200 trials. Thus each participant performed 2400 trials: (4 menus *×* 200 trials *×* 3 blocks). The total number of trials in the experiment was *28800* (12 participants *×* 2400 trials).

The target names were carefully selected to ensure that the user will need to navigate in the Contacts' List before reaching the required name. The alphabet was divided into 3 sets: the first set contained names starting with letters A to I, second set contained names starting with letters J to Q, and the last set contained names starting with letters R to Z, (see Fig. 6). Each block contained an equal number of names from the three sets. Names were not repeated between blocks to avoid learning effects. In this experiment a large Contact's List size was chosen to evaluate the difference in performance since the user has to scroll through many target names.

In this experiment we asked the participants to use only single hand while performing the experiment. Hence, the user uses only one hand to hold the mobile phone and experiments the Edge Menu and Linear Menu likely.

### **3.4 Results**

We analyzed the Mean Execution Time. Data from the practice trials was not used in the analysis. A univariate repeated measures ANOVA was carried out on the remaining data. Significant main effect was found for menu type. Mauchley's test indicated that the assumption of sphericity had been violated, therefore degrees of freedom were corrected using Greenhouse-Geisser estimates of sphericity F(2.79, 30.68) = 82.758. p < .0001 Post-hoc analyses were carried out to compare means for menu type. Four statistically significant groups were detected

**Fig. 6.** An explanation of each trial's arrangement

from the analysis, namely: Linear Menu and Linear Menu with wheel, Edge Menu and Edge menu with wheel. Thus, Linear Menu and Linear Menu with Wheel performance was similar, but together they were statistically different from the other three groups. The fastest performance was accomplished using Edge menu (μ = 5.5 s, σ = 0.15), followed by Edge menu with wheel (μ = 5.9 s,

**Fig. 7.** Study I: mean execution time


**Table 1.** Mean execution time for the four layouts using Single hand

σ = 0.21). Third came the Linear Menu (μ = 8.2, σ = 0.93) and Linear Menu with wheel (μ = 8.3 s, σ = 0.88) (refer to Table <sup>1</sup> and Fig. <sup>7</sup> for the results). Participants' errors in response were very few (2%). There was no significant difference between the different menu types.

#### **3.5 Discussion**

In conclusion, the U-shaped Edge Menu revealed better results than the regular Linear Menu; regardless the interaction technique used (circular or linear).

In addition to that, since the Edge Menu design spreads out the letters on three sides of the screen; left, right and bottom, this creates an opportunity for the user to use either one of his hands to interact or his two hands if the first letter of the first name and that of the last name reside in different sides. Although, it is not guaranteed to always have such an allocation. Consequently, half the Contacts' List entries were names whose first letters were residing in the same side and the other half were names whose first letters were residing in different sides. Therefore in Study II we aim to explore dual hand interaction as informed by the subjective measures from the participants (questionnaire).

#### **3.6 Post Study: Questionnaire**

It was really important to collect the subjective view of participants towards the design in general after finishing the first study and before doing any further research.

After the Participants finished the Experiment, a questionnaire was distributed among them. They were all satisfied by the experience and the options offered to them. However, the major comment we received was, that the participants will be more satisfied by the Edge Menu, if they were able to use both of their hands while navigating. Based on the information collected by the questionnaire, we carried the second study, enabling the participants to use both of their hands while navigating through the list.

### **4 Study II: Dual vs. Single Handed Interaction**

After proving that the Edge Menu outperforms the Linear Menu, while maintaining the same testing environment and conditions. It was time to prove that the Edge Menu can perform even better when using both hands, since the menu items are distributed along both screen edges (see Fig. 8). In this experiment the user was asked to use both of his hands while trying the new Edge Menu design using linear scrolling only. The circular scrolling technique was eliminated in this study since it wasn't proven that it is better than the normal linear technique. We investigated different lengths of lists - different numbers of contacts - to make sure that our study will almost fit most of the applications. The Experiment Design, Apparatus and Task were similar to that of Study I.

#### **4.1 Design**

This study has 2 independent variables, specifically the menu type with two levels; *Linear Menu* and *Edge Menu* and the list size with three levels; *201 entries*, *300 entries* and *600 entries*, and 2 dependent variables the error rate and mean execution time. The distribution of the blocks throughout the trial was same as of the First Study, (see Fig. 6). In each trial, the participant is instructed to locate and press on a specific contact name. Thus, simulating the typical interaction that occurs while calling a number.

#### **4.2 Participants and Procedure**

Similar to the first study, We recruited 36 participants (18 females) and (18 males) with an average age of 25 years (SD = 2.24) using university mailing lists. Six of the participants were left-handed. None of the participants had any previous experience using Edge Menus. After arriving in the lab and welcoming the participants, they signed a consent form and received an explanation of the purpose of the study.

We equally divided the participants to 3 groups each with 12 participants. First group was tested using 201 contacts; we divided them accordingly to 3 blocks, each having 67 trials. Thus each participant performed 402 trials: (2 menus *×* 67 trials *×* 3 blocks). The total number of trials in the experiment was *4824* (12 participants *×* 402 trials).

**Fig. 8.** Study setup - user while performing a trial

Second group was tested using 300 contacts; we divided them accordingly to 3 blocks, each having 100 trials. Thus each participant performed 600 trials: (2 menus *×* 100 trials *×* 3 blocks). The total number of trials in the experiment was *7200* (12 participants *×* 600 trials).

Finally, the Third group was tested using 600 contacts; we divided them accordingly to 3 blocks, each having 200 trials. Thus each participant performed 1200 trials: (2 menus *×* 200 trials *×* 3 blocks). The total number of trials in the experiment was *14400* (12 participants *×* 1200 trials).

#### **4.3 Results**

A paired samples t-test using the execution time of the Edge Menu and that of Linear Menu for each level of the Contacts' List size (201 - 300 - 600) was performed. The results were very promising. For the 201 Contacts level, The fastest performance was accomplished using Edge Menu, Edge Menu had statistically significant lower execution time (5.15 s) compared to Linear Menu (7.75 s), t(11) = 4.083, p < .05.

**Table 2.** Mean execution time for the two layouts using Dual hands


Also, for the 300 contacts, Edge Menu had statistically significant lower execution time (5.11 s) compared to Linear Menu (8.5 s), t(11) = 6.811, p < .05. Finally, for the 600 Contacts level, the fastest performance was accomplished using Edge Menu, Edge Menu had statistically significant lower execution time (5.6 s) compared to Linear Menu (8.7 s), t(11) = 6.534, p < .05.

Interestingly, results showed that for the 201 contacts, Edge Menu outperformed Linear Menu with 33.54%. Similarly, for the 300 contacts, Edge Menu outperformed Linear Menu with 39.88%. Finally, for the 600 contacts, Edge Menu outperformed Linear Menu with 35.63%. Impressively, results have shown slight improvement in performance of the users while using the Edge Menu with both hands than Edge Menu with Single hand. The average performance of the 2 menu types with different number of trials (201 - 300 - 600 Contacts) have been recorded (refer to Table 2 and Fig. 9 for the results).

#### **4.4 Discussion**

After performing the second study, it was proven that the Edge Menu outperforms Linear Menu, specially the Dual Edge Menu, and is worth for usage and for further research. This was the initial exploration but of course our study is for

**Fig. 9.** Study II: mean execution time

limited use case and we envision that this could be extended for wider application than the Contacts' List. Therefor, we investigated the extension of the U-shaped Edge Menu via Nested Menus to allow more content navigation/display.

# **5 Study III: Evaluating Nested U-Edge Menus**

In this Study we extend our design to include Nested Menus. Our goal with the evaluation was to find which menu is most efficient while working with a large-size list. A secondary goal was to understand the importance of our design goals and decide which is most relevant for future design efforts.

#### **5.1 Design**

The goal of this study is to compare the performance of the Edge Menu to a standard Linear Menu, on mobile, in the case of navigating a Nested Menu structure. We measured two dependent variables; execution time and error rate. The latter is defined as the percentage of trials with an incorrect selection of an item. The Execution Time is defined as the time between the communication of a menu item to the participant till tapping on that target. There were two independent variables; *Menu-Type* and *Menu-Depth*. *Menu-Type* had two levels; *Linear Menu* and *Edge Menu*. *Menu-Depth* had four levels; *Depth-2*, *Depth-3*, *Depth-4* and *Depth-5* - representing Nested Menus with different depths.

#### **5.2 Apparatus**

Our experimental setup consisted of a Samsung S3 device with a 4.8 in. (1280 *×* 720) display running Android 4.0.

**Fig. 10.** Nested Edge Menu. Each level of a Nested Menu is displayed on one side of the screen.

#### **5.3 Participants and Procedure**

Eleven unpaid university students, six males and five females, performed the experiment (age μ = 21.5).

In each trial, the participant is provided with a target menu item along with the path to follow to reach to that menu item. The participant task is to navigate through the menu, (see Fig. 10) and click on the specified menu item.

At the beginning of the experiment, the task was explained to the participant. Before using each of the two designs, an explanation of the menu and the interaction was provided and some practice trials were executed. We instructed participants to use a specific hand posture with each menu type. In the Edge Menu, the participant was asked to hold the phone using two hands and use the thumbs to select. Meanwhile, in the Linear Menu, the user holds the phone with one hand and uses the thumb of that hand to perform the interaction. The study duration was around 50 min.

The experiment was divided into 3 blocks, each having 20 trials. The total number of trials in the experiment was; (11 Participants *×* 2 Menus *×* 4 Depths *×* 20 Trials *×* 3 Blocks = 5,280 trials.

#### **5.4 Results**

Error Rate was very small (less than 2%), thus was not included in the analysis. Since we wanted to compare the performance of the Edge Menu to the Linear Menu at every nesting level, we conducted a paired samples t-test using the execution time of the Edge Menu and that of Linear Menu at each Menu-Depth. For Depth-5, Edge Menu had statistically significant lower execution time (3.88 s) compared to Linear Menu (6.1 s), t(10) = 3.3, p < .05. Similar results were found

**Fig. 11.** Study III: mean execution time

**Table 3.** Mean execution time for the two layouts using Nested Menus


for Depth-4, where Edge Menu mean execution time was (3.77 s) while Linear Menu was (4.95 s), t(10) = 2.9, p < .05. For Depth-3, Edge Menu mean was (3.4 s) while Linear Menu was (4.42 s), t(10) = 3.4, p < .05. In Depth-2, there was no statistical significance between the two menus (refer to Table 3 and Fig. 11 for the results).

### **5.5 Discussion**

In this experiment, the enhancement in performance due to Edge Menu was not the same at every menu-depth. In Depth-5, Edge Menu caused a decrease of 36% in execution time, while in Depth-4, it caused a decrease of 24%, and in Depth-3, the decrease was 22.6%. Thus in conclusion, the gain in performance increases as the number of levels in the menu increases. We believe that this is because at the first levels of the menu, the user has almost the same step counts. However, as we go deeper the step counts from the beginning of the trial increases and the user needs to interact more. Therefor, at this point the difference between the Edge Menu results and Linear Menu results are really significant.

# **6 Summary**

In the three studies, our results revealed that Edge Menu is faster, and yields better performance than the Linear Menu. In the two variations of the Edge Menu, the user utilized both hands to simultaneously enter the first letters, which is an example of a symmetric bi-manual task [5,8]. Using the two hands outperforms using a Single hand since the time to position a Single hand on the next target is eliminated. When unifying the testing conditions in the first experiment, using Single hand while testing both Edge Menu and Linear Menu. Results showed that Edge Menu outperformed Linear Menu by 32.93%. Similarly, the Edge Menu with wheel outperformed Linear Menu with wheel by 28.92%. In the second experiment, in an attempt to enhance the Edge Menu performance even more and meet the most comfortable position while holding the mobile phone, the user was asked to use both of his hands while testing the Edge Menu. Results showed that for the 201 contacts, Edge Menu outperformed Linear Menu with 33.54%. Also, for the 300 contacts, Edge Menu outperformed Linear Menu with 39.88%. For the 600 contacts, Edge Menu outperformed Linear Menu with 35.63%. Interestingly, results have shown slight improvement in Edge Menu using both hands than Edge Menu using Single hand. In the third study, the Edge Menu showed a remarkable decrease in the execution time, 36%, 24% and 22.6% for Depth-5, Depth-4 and Depth-3. We believe that the menu's icons size contributed to the positive results demonstrated by the Edge Menu. Spreading out the menu items, across the edge of the screen gives more space to each item. Each icon activation area along the sides in the Edge Menu was 1.5x as large as the activation area in the Linear Menu. Our results agree with previous works that larger activation areas yields faster performance [11] (Fig. 12 and Table 4).

**Fig. 12.** Average results summary


**Table 4.** Summary of the 3 studies' results

# **7 Limitations and Future Work**

There were several limitations we explored through designing the 3 studies, most of which have been resolved during performing the experiments. The main challenge was supporting different lists' sizes, this we were able to resolve in the second study by running the experiment on different Contact's List size. Only few of the limitations were left for future research. The main goal would be creating a platform that allows application designers to integrate/convert their work directly with the Edge Menu. We believe that the source code and research done in this paper should be available for other researchers in an open source library, to help out researchers to add their ideas.

# **8 Conclusion**

We developed the Edge Menu which is a U shaped menu fitted to the left, right and bottom edges of a mobile screen. An Edge Menu is superior to a Linear Menu by 23% to 40%. However, further research is required to enable the Edge Menu to support greater set of items - for example, languages with longer alphabet. While our findings suggest that the two variations of the Edge Menu will yield better performance in a larger list, this still needs to be verified using a formal study. The work explored the practicality and feasibility of Edge Menu design. Based on our user studies and experiments, it is proven that the Edge Menu yields better performance than the regular Linear Menu. By these results, encouraging software developers and application designers to start integrating Edge Menu with their designs instead of Linear Menu, and explore the capabilities offered by this relatively new design.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Applications

### **Design Considerations for Self Paced Interactive Notes on Video Lectures - A Learner's Perspective and Enhancements of Learning Outcome**

Suman Deb1(B) , Anindya Pal<sup>1</sup>, and Paritosh Bhattacharya<sup>2</sup>

<sup>1</sup> Department of CSE, National Institute of Technology Agartala, Jirania, India sumandeb.cse@nita.ac.in, anindya2674@gmail.com

<sup>2</sup> Department of Mathematics, National Institute of Technology Agartala, Jirania, India

pari76@rediffmail.com

**Abstract.** Video lectures form a primary part of MOOC instruction delivery design. They serve as gateways to draw students into the course. In going over these videos accumulating knowledge, there is a high occurrence of cases [1] where the learner forgets about some of the concepts taught and focus more on what is the minimum amount knowledge needed to carry forward to attempt the quizzes and pass. This is a step backward when we are concerned with giving the learner a learning outcome that seems to bridge the gap between what he knew before and after the course to completion. To address this issue, we are proposing an interaction model that enables the learner to promptly take notes as and when the video is being viewed. The work contains a functional prototype of the application for taking personalized notes from MOOC contents. The work [12] is an integration of several world leading MOOC providers content using application program interface(API) and a customize interface module for searching courses from multiple MOOC providers as COURSEEKA and personalised note taking module as MOOKbook. This paper largely focuses on a learner's perspective towards video based lectures and interaction to find the enhancements in interaction with longer retention of MOOC contents.

**Keywords:** MOOC note *·* Self pace learning *·* Personalized learning MOOCbook *·* MOOC Video Interaction *·* Enhanced learning outcome

## **1 Introduction**

A MOOC is a model of delivering education in varying degrees, massive, open, online, and most importantly, a course [13,14]. Most MOOCs have a structure similar to traditional online higher education counterparts in which students watch lectures online and offline, read material assigned to them, participate in online forums and discussions and complete quizzes and tests on the course material. The online activities can be supplemented by local meet-ups among students who live near one another (Blended Learning) [3]. The primary form of information delivery in MOOC format is videos. One of the challenges faced by the online learners of today is the need of an interface which enables to take notes from the video lectures [2]. Traditional methods used thus far by the student community are time absorbing and cumbersome in terms of organization. This work is an attempt to address the issue enabling the learner to focus more on the curriculum than on how to compile and access the materials later. As MOOC courses being accessed through out world, beyond any geographical region. Inherently it triggers another level of interaction and understanding difficulty due to cultural, linguistic variation. In addition to that human learning variation takes a great role in graceful MOOC acceptance, learning pleasure and learning outcome.

# **2 Problem Statement**

There is a significant concern over what the learners end up learning as compared to what the MOOC instruction designer intended them to do [4,7]. Many just fall into the trap of knowing just enough to pass the quizzes and course assessments, this neglecting any other concepts that learner may have eventually come across but forgotten about it [8,18]. For the learners who seem to acknowledge this issue on their own, they tend to view the videos again and again until they feel that they have substantial command over the topic being taught in these videos [6,9]. Now, while this may be a good practice, this takes an awful amount of time. Also, watching multiple video lectures on a specific topic may overlap various contents as well tend to forget [15] the previously viewed contents [10,16]. Instead, if there was an interface that lets the learner decide on taking essential parts of the video in a form which can enable them to revise the concepts later and on-demand, it would make sense. This work designs an integrated *MOOC takers note book* that makes an integration of various course providers content on a personalized note interface [11]. This enables cross reference, transcript copy, still frame capture and personalize text note. Taking notes are a manifestation of that conscious effort of peoples natural tendency to forget things with time [19]. Lecture or handouts given in class by the instructor are all the same but people seem to remember more when they are actively taking a record of what is happening, on their own. But there is a flipside to the scenario in digital note taking. People are more reluctant to take notes verbatim, with every word on the document [5]. The trade off between digital and conventional notes are discussed in the experiments presented in [17]. But despite these findings, modern day challenges make to utilize ones time in the best possible way.

# **3 MOOCbook a Novel Model**

Since videos represent the most significant part of MOOCs, it is a mandate that the note taking process will revolve around them. The length of the videos varies from provider to provider, typically ranging from 2–4 min (micro-lectures) to a maximum of 15 min. As a video progresses, there are certain checkpoints that an instructor breaks a topic into and these checkpoints serve as the keynotes for the topic at hand. For example, a video about supervised learning of machine learning would typically discuss about the common examples in which it is used, then explain the algorithm employed, plot the points representing the features and interpret it, differentiate with other machine learning algorithms and finally conclude the scenarios and advantages where the algorithm applies.

**Fig. 1.** MOOKBook work flow

These checkpoints, although important to the MOOC taker at that instant, seem to fade away when the next video starts. The MOOC taker is reluctant on obligating to memory rather tend only to remember those parts which are needed to pass the quizzes. To address this issue, we propose a novel model whereby the MOOC taker can take notes on the fly when they are taking the course through watching videos. For the MOOC taker, the parts of the course which they intend to take note, it happens to be certain points in the video. It is assumed that the video is accompanied by an interactive transcript that scrolls and highlights what the instructor is saying at that moment of the video. During the video, there may happen to be equations, diagrams, graphs and example scenarios

that explains the topic from various perspectives. To take the corresponding notes by hand, it would take stopping the video, taking the conventional note book up and writing or drawing whats on the video screen at that instant. This would take up the valuable time that the MOOC taker has invested already. The proposed on the go note taking, while the MOOC taker watches the video is a meta description extraction using a client side scripting on the browser that the learner is currently using to access the materials. The parts of the lecture which catches the attention of the learner are simultaneously displayed in the transcript. A recurrence script extracts transcript with the screen and add the portions to the notebook on events initiated by the user. The learner can save a considerable amount of time which they would otherwise be using for taking the notes conventionally. The user can view the updated note in the browser itself so that it gives a better perspective of what has been learnt (Fig. 1).

### **4 Architectural Design**

#### **4.1 Design of COURSESEEKA Module**

As a starting point in achieving the goals set forth by, an online interface has been developed where the learners first objective i.e. the need to identify suitable courses that may address his current learning objective, from an array of courses enlisted in various course providers, namely coursera, udacity and udemy. edx

#### 112 S. Deb et al.

**Fig. 2.** Fuzzy closeness approximation algorithm in action for filtering a search from multiple MOOC providers simultaneously **Fig. 3.** A retrieved course from (a) Udemy (b) Coursera

had also been approached for their API on two occasions but both the requests got rejected. These course providers are fairly popular and have gained trust among the learning masses as the MOOC movement took place. Also, these have well defined APIs which enlist course related information that can be obtained easily. The COURSEEKA interface is based on the architecture as described by Fig. 5. The interface aims to find courses available from three course providers, namely courser.org, udacity.com and udemy.com and combine their results into a single web page where a user can query a course specific search term according to his learning objective, and the courses will then be filtered accordingly (Figs. 2, 3, 4 and 5).

**Fig. 4.** MOOKBook multi modal note generation

### **4.2 Modified Fuzzy Closeness Approximation Algorithm**

Existing interfaces on course search is based on matching the keywords wholly. While this may seem as a very nave way to get courses recommended to a learner based on his search term, our web application has a learner centric approach to getting the search results that will suit someone who is willing to manage his online MOOC curriculum in a very specific way. His search results are constrained to be from one of the major MOOC providers (as has been told already) existing today. Moreover, the search algorithm is based on a modified fuzzy string closeness approximation algorithm which is clever enough to infer what the MOOC learner is specifically searching for even if he is halfway through or even less than what he intends to type.

# **5 Implementation**

### **5.1 Prototype Specifications**

The prototype is a web application that hosts a video with interactive transcript and has control buttons to preview and append notes. The interface aims to capture portions of the text of the video content i.e. the transcript along with screen captures, preview them and append to the notebook inside the webpage itself. Finally, the user has the option to download the notebook thus formed. All of this happens using client side scripting, which is relevant since time is of the essence when the user is taking the note as the video plays. This eliminates the load off the servers hosting massive amounts of data in the MOOC servers.

### **5.2 Prototype Demonstration**

An initial working prototype has been implemented which uses the three APIs combined and lists all the courses relevant to a learners interest as they types in a search query. The search results are then displayed centrally using a Fuzzy String Closeness Approximation. As an example of working demo, the video course cited is one of the those featured in the first week of the Machine Learning course by Professor Andrew Ng of Stanford university, hosted by coursera.org. The instructor goes about explaining Un-supervised learning in the course. Figure 6 The distinguishable parts of the video are listed as under: 1. Difference between unsupervised learning and supervised learning (two graphs). 2. Applications of Supervised Learning (images depicting them). 3. Tackling a problem (cocktail party problem) using unsupervised learning (image de-picting the scenario). 4. Cocktail party problem algorithm (code in python) 5. A quiz with options to choose from. These distinguishable parts are of concern to the learner when compiling a digital note about the video. The MOOCbook interface is equipped to take snapshots of these parts and scrape the transcripts of the relevant portions as and when the learner deems it necessary. Figure 6 shows a screen of the video captured for preview. The snapshot is taken using the videos and videos interactive transcript JS libraries in tandem. If the preview is deemed good for


**Fig. 6.** MOOKBook GUI and interactions **Fig. 7.** Analytic dashboard



**Fig. 8.** MOOCBook final note in MS word **Fig. 9.** Example clickstream data collected from google analytics

adding to the note, the user then proceeds accordingly. To capture the lecture discussions relevant to the note being compiled, we have made use of the VTT file available with the video in courser. The VTT file has timestamps along with text content, which is scraped using suitable javascript code, and added to the note. Thus, the cocktail party problem algorithm now has a proposed problem, a solution with code and relevant transcripts, all in one note, viewable in the browser itself where the video is still playing. The note thus far compiled, is now available for download to the client machine using the jquery word export plugin made using JS. The final note file is a MS Word document (Figs. 7, 8 and 9).

# **6 Synthesis of Experiments and Result**

The system developed for taking notes from MOOCs, namely MOOCbook is taken up for testing effectiveness. Pretests were concluded before the actual experiment to establish clear reference point of comparison between treatment group and control group. To investigate whether the proposed system effectively generates a learning outcome that lasts even after the video completes, post tests were conducted between the two groups. The subject matter that is portrayed in the two videos which are featured in the system developed is an introduction to the two major varieties of Machine Learning algorithms. Both the treatment and the control groups have a basic knowledge of what Machine Learning is about.

### **6.1 Evaluation Criterion**

The current MOOC interfaces available on the Internet featured on MOOC platforms like coursera, udacity etc. are designed to deliver content over multiple media formats. The primary format, namely videos are designed to be accompanied by in-video quizzes that assess the learners comprehension with the help of in-video quizzes as well as separate assessments module. But certain parts of the video are overlooked by the learner because he may be impulsively following the video to complete the quizzes and the assessments. For this purpose, it happens that the learner may have peeked into the questions beforehand and accordingly is inclined to get the answers from the video. So, he is skimming portions of the video in order to find the answers and thus is not open to effective learning. The questions to understand how the system enhances the learning outcome of a learner have been identified as under:


### **6.2 Methodology**

The participants of this experiment are 6th Semester Under Graduate Engineering students. There are 84 students in total, divided into two groups, one being a control group and the other being the treatment group. They are shown two videos each on the system developed. The control group gets to see only the videos, while the treatment group sees the MOOCbook interface at play, which enables them to take notes if necessary. Each of the participants are allotted 40 min for viewing the videos. The combined length of the videos is (12.29 + 14.13) = 26.42 min. Throughout the duration of the video featured in MOOCbook, all activities of the user are recorded with the help of Google analytics that serve as a gateway to learn key insights into how the users interact with the video player while seeing the videos. The data collected through Google analytics is downloadable and hence form our dataset of study. The data downloaded from Google analytics is in the form of csv files which are obtained individually from all the 84 users of the experiment. The effectiveness of the MOOCbook interface was tested using independent-samples t-test. It is aimed to compare means between two unrelated groups on the same continuous variable. In this case, it has been used to understand whether the learning outcome undergraduate engineering student is increased on the application of MOOCbook. Thus the independent variable here is "User of MOOCbook or not "(one of the groups being users who had MOOCbook interface at their disposal and the other being who did not use MOOCbook) and the dependent variable is the "learning outcome ".

**Assumptions.** As requirement of independent t-test, the 6 point compliance of assumptions as detailed under.


### **6.3 Instruments**

The various analytical processes aimed at answering the questions identified have been enlisted here. A short demonstration was performed which walked through the MOOCbook interface to the participants before the experiment so that they are familiar with the system. A questionnaire aimed at measuring MOOC awareness among the participants serves as a pretest before the experiment, and two post tests comprising data analysis of clickstream events generated during experiment and a quiz is aimed at testing effectiveness of the MOOCbook interface.

**Pretest.** The pretest was carried out before the participants were given access to the system. The two groups were surveyed about their MOOC awareness. A questionnaire specific to MOOC awareness was used in this regard.

### **Post Intervention Tests**

1. **Clickstream data analysis** - To address how behavior of participants differ on the provision of the MOOCbook interface in terms of interaction with the video (questions 1–3 of **Evaluation Criteria** section), the data generated through clickstream events of the video on the google analytics server was analyzed.

2. **Learning outcome** - To answer the questions 4 and 5 enlisted in **Evaluation Criteria** section, a quiz was conducted with the participants and the results were evaluated.

#### **Null Hypotheses**


**Pretest Results.** Data in Fig. 10 shows that the treatment group pre-test mean scores was 7.07 (SD = 2.443) while the control group pre-test mean score was 6.88 (SD = 2.098). To ensure the comparison between two groups, a two tailed t-test was done on the sample for 5% level of significance. The findings are shown in Fig. 11. The mean difference between the treatment and the control group in terms of Test Scores is 0.190. The findings (Fig. 11) lead to the conclusion that there is no significant difference between the treatment and control group prior to the experimental study conducted. Both groups were found to have common ground of knowledge when it comes to MOOCs and thus are ideal for the MOOCbook test scenario. Hence hypothesis H1 failed to be rejected (Fig. 12).

**Fig. 11.** Independent samples test as pretest **Fig. 12.** Normal distribution for pretest scores

**Post Test Results Clickstream Data Analysis.** Data in Fig. 13 shows the summary of clickstream data obtained from 84 participants. The control group generated analytics data from only the video player interactions like play, pause, fullscreen etc. while the treatment group was capable of generating note-taking events like Add Text To Note, Add Image To Note etc. in addition to what control group users were allowed. For the purpose of analysis, only the clickstream data with respect to video player interactions is taken up. The note-taking interactions will not be taken in the post test analysis. The normal distribution graph of the post-test-1 scores for the two groups is shown in Fig. 15. To analyze the hypothesis H2, a two tailed t-test was done on the sample for 5 % level of significance. The mean difference between the treatment and the control group in terms of the number of events registered while watching the videos is -32.667. The findings of the Independent Samples Test is depicted in Tabular data Fig. 14. The above findings lead to the conclusion that there is a significant difference between the treatment and control group post the experimental study conducted. Both groups were found to have interacted in a very different way when it came to viewing the videos. The number of clickstream events were far higher for the control group without the notes system than the treatment group with notes enabled. This leads to the conclusion that hypothesis H2 is false and does not hold.


**Fig. 13.** Post-test clickstream results of treatment and control group

**Fig. 14.** Independent samples test as posttest 1 **Fig. 15.** Normal distribution for post test 1 scores

**Learning Outcome.** The Post Test 2 is a questionnaire that aims to find the learning outcome of the participants. The questions contained here are set from the content of the two videos that are hosted in the MOOCbook system. The control group once again is devoid of the functionality of taking notes whereas the treatment group is notes module enabled. The results obtained as shown in Fig. 16 will be directly connected with how much of the lessons depicted within the videos are comprehended by the users. Thus the direct measure of how much a knowledge a learner can retain will be obtained. To analyze the hypothesis H2, a two tailed t-test was done on the sample for 5% level of significance. The mean difference between the treatment and the control group in terms of the Qscores (scores obtained by the participants on the questionnaire) is *−*2.333. The findings of the Independent Samples Test is depicted in Fig. 17. The findings lead to the conclusion that there is a significant difference between the treatment and control




**Fig. 17.** Independent samples test as post-test 2

group post the experimental study conducted. Both groups were found to have had a very different learning outcome in terms of understanding the contents depicted in the videos. The number of correct answers for the quiz questions were far higher for the treatment group with the notes system enabled than the control group with notes disabled. This leads to the conclusion that hypothesis H3 is false and does not hold. Thus the notes module plays a significant part in terms of making the lessons more content aware to the learners. They are able to differentiate key points told by the lecturer and form memory mappings of lesson checkpoints which later help them to retrieve the same, i.e. recall lesson key points.

# **7 Conclusion**

This work is an attempt to address the issues enabling the learner to focus more on the curriculum than on how to compile and access the materials later. A novel model MOOCbook was presented and a working prototype has been demonstrated for this purpose. The results obtained have provided us with some insights to get into what people are looking for in terms of enhancing their learning outcome. One of the major finding was a need of self paced MOOC note. The empirical experiments conducted and anecdotal response have shown significant improvement in engagement to accomplish MOOC course as well enhancement in learning outcome. All the work has been done from a learner's perspective. The inclusion of this tool along with MOOC provider's platforms will pave the way for enhanced digital learning in the future.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Using Psycholinguistic Features for the Classification of Comprehenders from Summary Speech Transcripts**

Santosh Kumar Barnwal(B) and Uma Shanker Tiwary

Indian Institute of Information Technology, Allahabad, India iis2009002@gmail.com

**Abstract.** In education, some students lack language comprehension, language production and language acquisition skills. In this paper we extracted several psycholinguistics features broadly grouped into lexical and morphological complexity, syntactic complexity, production units, syntactic pattern density, referential cohesion, connectives, amounts of coordination, amounts of subordination, LSA, word information, and readability from students' summary speech transcripts. Using these Coh-Metrix features, comprehenders are classified into two groups: poor comprehender and proficient comprehender. It is concluded that a computational model can be implemented using a reduced set of features and the results can be used to help poor reading comprehenders for improving their cognitive reading skills.

**Keywords:** Psycholinguistics · Natural language processing Machine learning classification

### **1 Introduction**

Reading is a complex cognitive activity where learners read texts to construct a meaningful understanding from the verbal symbols i.e. the words and sentences and the process is called as reading comprehension. In Reading process, the three main factors - the learner's context knowledge, the information aroused by the text, and the reading circumstances together construct a meaningful discourse. Previous researches claim that in academic environment several reading and learning strategies including intensive reading and extensive reading [2], spaced repetition [7] and top-down and bottom-up processes [1] play vital role in students developing comprehension skills.

**Intensive Reading:** It is the more common approach, in which learners read passages selecting from the same text or various texts about the same subject. Here, content and linguistic forms are repeated themselves, therefore learners get several chances to comprehend the meaning of the textual contents. It is usually classroom based and teacher centric approach where students concentrate on linguistics, grammatical structures and semantic details of the text to retain in

c The Author(s) 2017 P. Horain et al. (Eds.): IHCI 2017, LNCS 10688, pp. 122–136, 2017. https://doi.org/10.1007/978-3-319-72038-8\_10

memory over a long period of time. Students involve themselves in reading passages carefully and thoroughly again and again aiming to be able translating the text in a different language, learning the linguistic details in the text, answering comprehension questions such as objective type and multiple choice, or knowing new vocabulary words. Some disadvantages are - (a) it is slow, (b) needs careful reading of a small amount of difficult text, (c) requires more attention on the language and its structure, including morphology, syntax, phonetics, and semantics rather than the text, (d) text may be bored to students since it was chosen by the teacher, and (e) because exercises and assessments are part of comprehension evaluation, students may involve in reading only for the preparation for a test and not for getting any pleasure.

**Extensive Reading:** On the other hand, extensive reading provides more enjoyments as students read big quantities of own interest contents; focus on to understand main ideas but not on the language and its structure, skipping unfamiliar and difficult words and reading for summary [12]. The main aim of extensive reading is to learn foreign language through large amounts of reading and thus building student confidence and enjoyment. Several Research works claim that extensive reading facilitating students improving in reading comprehension to increase reading speed, greater understanding of second language grammar conventions, to improve second language writing, and to motivate for reading at higher levels [10].

The findings of previous researches suggest that extensive and intensive reading approaches are beneficial, in one way or another, for improving students' reading comprehension skills.

**Psycholinguistic Factors:** Psycholinguistics is a branch of cognitive science in which language comprehension, language production and language acquisition are studied. It tries to explain the ways in which language is represented and is processed in the brain; for example, the cognitive processes responsible for generating a grammatical and meaningful sentence based on vocabulary and grammatical structures and the processes which are responsible to comprehend words, sentences etc. Primary concerned linguistic related areas are: Phonology, morphology, syntax, semantics, and pragmatics. In this field, researchers study reader's capability to learn language for example, the different processes required for the extraction of phonological, orthographic, morphological, and semantic information by reading a textual document.

More recent work, Coh-Metrix [5] offers to investigate the cohesion of the explicit text and the coherence of the mental representation of the text. This metrix provides detailed analysis of language and cohesion features that are integral to cognitive reading processes such as decoding, syntactic parsing, and meaning construction.

### **2 Brief Description of Coh-Metrix Measures**

Coh-Metrix is an automatic text analysis tool forwarding traditional theories of reading and comprehension to next higher level and therefore, can plays important role in different disciplines of education such as teaching, readability, learning etc. The tool analyses and measures features of texts written in English language through hundreds of measures, all informed by previous researchers in different disciplines such as computational linguistics, psycholinguistics, discourse processes and cognitive sciences. The tool integrates several computational linguistics components including lexicons, pattern classifiers, part-ofspeech taggers, syntactic parsers, semantic interpreters, WordNet, CELEX Corpus etc. Employing these elements, Coh-Metrix can analyze texts on multi levels of cohesion including co-referential cohesion, causal cohesion, density of connectives, latent semantic analysis metrics, and syntactic complexity [5].

All measures of the tool have been categorized into following broad groups:


The aim of present work is to identify the linguistic features that can classify students into two groups - students having proficient comprehension skills and students with poor comprehension skills from their summary speech transcripts.

# **3 Participants and Method**

A brief description of the participants, materials, and procedure that we used in this study are described here.

**Participants:** Twenty undergraduate students (mean age (SD)- 21.4(0.86)) in information technology major; studied in same batch and performed all academic activities only in English, whereas their primary languages were different; participated in this experimental sessions. Students were told that they would be awarded some course credits for participating in the research. Based on their academic performance in last four semesters, these students were divided into two groups - ten as proficient and others as poor comprehenders.

**Materials:** The reading materials consisted of two passages. One passage (total sentences- 38, total words- 686, sentence length (mean)- 18.0, Flesch-Kincaid Grade level- 13.3) had been selected from students' course book whereas other was a simple interesting story (total sentences- 42, total words- 716, sentence length (mean)- 17.0, Flesch-Kincaid Grade level- 3.9). Both passages were written in English and were unread until the experiment began. Reading story passage was simulated extensive reading experience and reading course passage was simulated intensive reading experience.

**Procedure:** All experimental sessions were held in a research lab in a set of 5 students. The experiment consisted of two tests. In each test, student had instructed to read a given passage and then to solve a puzzle and lastly to tell summary as much detail as they can. Both tests were similar except the reading material - the story passage was given in first test and the course passage was given in second test. Students were informed to read the passage on computer screen as they would normally read. The speech were recorded using a digital audio recorder software installed in the computer system. The puzzle task was useful to erase students' short term memory of read text to ensure that the summary would come from their long term memory.

### **4 Feature Analysis**

**Feature Extraction:** The recorded audio files were transcripted in English where brief pauses were marked with commas, while long pauses were marked with full stops (end of sentence) if their places were according to semantic, syntactic and prosodic features. Repetitions, incomplete words and incomprehensible words were not included in transcription. In the experiment, two sets of transcripts were generated - (a) **story transcripts** had texts of story summary audio files and (b) **course transcripts** had texts of course summary audio files. Both sets had twenty texts, ten of proficient comprehenders' audio files and the other ten of poor comprehenders' audio files.

For analysing the texts of both sets of transcripts, we used the computational tool Coh-Metrix. Coh-Metrix 3.0 (http://cohmetrix.com) provided 106 measures; which were categorized into eleven groups as described in Sect. 2.

**Feature Selection:** In machine learning classifiers including too many features may lead to overfit the classifier and thus resulting in poor generalization to new data. So, only necessary features should be selected to train classifiers.

We applied two different approaches for the selection of necessary features improving the accuracy of the classifiers.

*Approach-1:* Coh-Metrix provides more than hundreds of measures of text characteristics and several of them are highly correlated. For example, Pearson correlations demonstrated that *z score of narrativity* was highly correlated (*r* = 0*.*911, *p <* 0*.*001) with *percentile of narrativity*. Of 106 measures of the tool, 52 variables were selected on the basis of two criteria. First, all such variables which had high correlations with other variables (|*r*| ≥ 0*.*80) were discarded for handling the problem of collinearity. Remaining measures were grouped in feature sets. Thus, after removing all such redundant variables, the feature set of story transcripts had 65 measures whereas the feature set of course transcripts had 67 measures. In Table 1, superscripts 1, 2 and 3 indicate measures presented in only story transcripts, in only course transcripts and in both transcripts respectively. Therefore, in first step, measures indicated with superscripts 1 and 3 were selected for the classification of story transcripts; whereas measures indicated with superscripts 2 and 3 were selected to classify the course transcripts. In next step, we had selected only those measures which were presented in both feature sets. Therefore, in second step, 52 common measures indicated with superscript 3 in Table 1, were selected for the classifications.

**Pairwise Comparisons:** Pairwise comparisons were conducted to examine differences between proficient comprehenders' text and poor comprehenders' text of both sets of transcripts (story and course). These results are reported below.

1. Descriptive measures: Co-Metrix provided eleven descriptive measures in which six measures were selected as features. Paragraph count, Paragraph length, Sentence length and Word length had significant difference between


**Table 1.** A comparison of proficient and poor comprehenders' transcripts features. Values shown are mean (standard deviation).

(*continued*)


**Table 1.** (*continued*)

(*continued*)


#### **Table 1.** (*continued*)

(*continued*)


#### **Table 1.** (*continued*)

proficient comprehenders' text and poor comprehenders' text of both sets of transcripts.


significant. Poor comprehenders' transcripts had a comparatively greater proportion of pronouns compared to that of proficient comprehenders.

11. Readability: The tool provided three readability measures in which one measure was selected as feature. Flesch Reading Ease had significant difference between proficient comprehenders' text and poor comprehenders' text of both sets of transcripts.


**Table 2.** A comparison of proficient and poor comprehenders' features extracted from story transcripts. Values shown are mean (standard deviation).

*Approach-2:* In this approach, we selected appropriate features from all 106 Coh-Metrix measures by applying Welch's two-tailed, unpaired t-test on each measure of both types of comprehenders' transcripts. All features that were significant at p *<* 0.05 were selected for classification. Thus, the feature set of story transcripts had 15 measures (Table 2) whereas the feature set of course transcripts had 14 measures (Table 3).


**Table 3.** A comparison of proficient and poor comprehenders' features extracted from course transcripts. Values shown are mean (standard deviation).

# **5 Classification**

We examined several classification methods such as Decision Trees, Multi-Layer Perceptron, Na¨ıve Bayes, and Logistic Regression using Weka toolkit [6]. 10-fold cross-validation method had been applied to train these classifiers. The results of these classifiers are reported in Table 4 in terms of classification accuracy and root mean square error (RMSE). The classification accuracy refers to the percentage of samples in the test dataset that are correctly classified (true positives plus true negatives). Root-mean-square error (RMSE) is frequently used as measure of the differences between values predicted by a classifier and the values expected. In this experiment, it provided the mean difference between the predicted students' comprehension level and the expected comprehension level. The baseline accuracy represents the accuracies that would be achieved by assigning every sample to the larger training size of the two classes. In this experiment, both classes had 10 training samples, therefore, the baseline accuracy for poor vs. proficient comprehenders' transcripts would be achieved by assigning all the samples in any one group and thus the baseline accuracy of the experiment would be 0.5 (10/20 = 0.5).

# **6 Result and Discussion**

Table 4 shows the accuracies for classifying poor vs. proficient comprehenders' transcripts. The classifier accuracies were not as high for approach-1 compared to approach-2; however, they were above or equal to the baseline for all four classifiers. Also, common features provided better accuracies as compared to first


**Table 4.** Accuracies for the four classifiers.

step features (story or course feature set). In this experiment, the reduced set of features applied in approach-2, provided best results for all four classifiers. However it was observed that selection of features using approach-2 were dependent on the participants involved in the experiment as well as the read text; whereas the features of approach-1 were almost robust against these changes. The major findings of this study demonstrate that three cohesion indices- lexical diversity, connectives, and word information, common in both Tables 2 and 3, played a vital role in the classification of both types of the transcripts. The logistic regression classifier classified story transcripts and course transcripts with accuracies 100% and 80% respectively.

Generally in first attempt of reading a new text, science and technology course does not help most students to develop mental model to represent the collective conceptual relations between the scientific concepts, due to lack of their prior domain knowledge. In contrast, story texts carry some general schema such as name, specific place and chronological details of an event; all these schema help students to develop mental model by integrating these specific attributes of the event described in the story [11]. Therefore, students stored the mental model of story text comparatively in more details in their memory compared to that of course text; which was reflected in their transcripts. Proficient and poor both students' story transcripts contained more noun phrases in comparison to course transcripts.

Poor comprehenders may not benefit as much as good comprehenders from reading a complex text because grammatical and lexical linking within the text increases text length, density, and complexity. As a consequence, reading such text involves creation and processing of more complex mental model. Comprehenders with low working-memory capacity experience numerous constraints on the processing of these larger mental models, resulting in lower comprehension and recall performance [8]. As a result poor comprehenders' transcripts consist of comparatively more sentences with mixed content representing their confused state of mental models. Therefore, as shown in Table 1, values of the measures of situation model index were more in poor comprehenders' transcripts in contrast to proficients' transcripts.

The finding in this study also validates a previous study [3], which demonstrated that less-skilled comprehenders produced narratives that were poor in terms of both structural coherence and referential cohesion.

In short, the Coh-Metrix analysis of transcripts provides a number of linguistic properties of comprehenders' narrative speech. Comprehension proficiency were characterized by greater cohesion, shorter sentences, more connectives, greater lexical diversity, and more sophisticated vocabulary. It is observed that lexical diversity, word information, LSA, syntactic pattern, and sentence length provided the most predictive information of proficient or poor comprehenders.

In conclusion, the current study supports to utilize Coh-Metrix features to measure comprehender's ability.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **LECTOR: Towards Reengaging Students in the Educational Process Inside Smart Classrooms**

Maria Korozi1(✉) , Asterios Leonidis<sup>1</sup> , Margherita Antona1 , and Constantine Stephanidis1,2

<sup>1</sup> Foundation for Research and Technology – Hellas (FORTH), Institute of Computer Science (ICS), Heraklion, Greece {korozi,leonidis,antona,cs}@ics.forth.gr <sup>2</sup> Department of Computer Science, University of Crete, Heraklion, Greece

**Abstract.** This paper presents LECTOR, a system that helps educators in under‐ standing when students have stopped paying attention to the educational process and assists them in reengaging the students to the current learning activity. LECTOR aims to take advantage of the ambient facilities that "smart classrooms" have to offer by (i) enabling educators to employ their preferred attention moni‐ toring strategies (including any well-established activity recognition techniques) in order to identify inattentive behaviors and (ii) recommending interventions for motivating distracted students when deemed necessary. Furthermore, LECTOR offers an educator friendly design studio that enables teachers to create or modify the rules that trigger "inattention alarms", as well as tailor the intervention mech‐ anism to the needs of their course by modifying the respective rules. This paper presents the rationale behind the design of LECTOR and outlines its key features and facilities.

**Keywords:** Smart classroom · Attention monitoring · Ambient intelligence

### **1 Introduction**

In the recent past there has been growing interest in how Information and Communica‐ tion technologies (ICTs) can improve the efficiency and effectiveness of education; it is acknowledged that when used appropriately, they are potentially powerful tools for advancing or even reshaping the educational process. In more details, ICTs are claimed to help expand access to information and raise educational quality by, among others, helping make learning and teaching a more engaging, active process connected to real life [27]. Learning with the use of ICTs has been strongly related to concepts such as distance learning [4], educational games [7], intelligent tutoring systems and e-learning applications [5]. Additionally, the notion of "smart classrooms", where activities are enhanced and augmented through the use of pervasive and mobile computing, sensor networks, artificial intelligence, etc. [6], has become prevalent in the past decade [30].

However, despite the fact that the educational process is continuously enriched with engaging activities, it is almost inevitable that students will get distracted either by internal stimuli (e.g., thoughts and attempts to retrieve information from memory) or external stimuli from the physical (e.g., visuals, sounds) or digital environment (e.g., irrelevant applications); hence, they might not always be "present" to take advantage of all the benefits that a "smart classroom" has to offer. This observation highlights the need for a mechanism that monitors the learners and when necessary, intervenes to appropriately reset attention levels.

The proposed system, named LECTOR, aims to take advantage of the ambient facilities that "smart classrooms" have to offer and enable educators to employ their preferred attention monitoring strategies (including any well-established activity recog‐ nition techniques) in order to identify inattentive behaviors and assist the educator in reengaging the students to the current learning activity. In more details, the main contri‐ butions of this work are listed below:


### **2 Background Theory**

Attention is very often considered as a fundamental prerequisite of learning, both within and outside the classroom environment, since it plays a critical role in issues of moti‐ vation and engagement [20]. However, as passive listeners, people generally find it difficult to maintain a constant level of attention over extended periods of time, while pedagogical research reveals that attention lapses are inevitable during a lecture. McKeachie [16], suggests that student attention will drift during a passive lecture, unless interactive strategies are used. According to [31], student concentration decays in the same way during a passive lecture as does that of a human operator monitoring auto‐ mated equipment, with serious implications for learning and performance. Obtaining and maintaining the students' attention is an important task in classroom management, and educators apply various techniques for this purpose, however currently no techno‐ logical support is available to assist educator in monitoring students' behavior in the classroom and maximizing students' engagement at the task at hand. According to Packard [19], "classroom attention" refers to a complex and fluctuating set of stimulusresponse relationships involving curriculum materials, instructions from the teacher and some prerequisite student behaviors (e.g., looking, listening, being quiet, etc.). Such behaviors can be rigorously classified as "appropriate" and "inappropriate" [26]. Appro‐ priate behaviors include attending to the teacher, raising hand and waiting for the teacher to respond, working in seat on a workbook, following text reading, etc., while inappro‐ priate behaviors include (but are not limited to) getting out of seat, tapping feet, rattling papers, carrying on a conversation with other students, singing, laughing, turning head or body toward another person, showing objects or looking at another class mate. Some of the above behaviors would be in fact disruptive to some educational activities. However, the students should not be forced to spend their whole day not being children, but being quiet, docile, and obedient "young adults" [29]. On the contrary, learning can be more effective if students' curiosity, along with their desire to think or act for them‐ selves, remains intact.

Attention aware systems have much to contribute to educational research and prac‐ tice. These systems can influence the delivery of instructional materials, the acquisition of such materials from presentations (as a function of focused attention), the evaluation of student performance, and the assessment of learning methodologies (e.g., traditional teaching, active learning techniques, etc.) [20]. However, existing approaches [3, 17, 22, 23, 28] concentrate mainly on computer-driven educational activities. This work broadens the perspective by employing attention monitoring in a real classroom and incorporating a mechanism for suggesting improvements for the learning process; most importantly though, it empowers educators to customize or even create from scratch new inattention detection rules (e.g., "*if the students whisper while the educator is writing to the whiteboard…*") and intervention strategies.

### **3 The Smart Classroom Behind LECTOR**

LECTOR is employed inside a technologically augmented classroom where educational activities are enhanced with the use of pervasive and mobile computing, sensor networks, artificial intelligence, multimedia computing, middleware and agent-based software [1, 13, 24]. In more details, the hardware infrastructure includes both commer‐ cial and custom-made artifacts, which are embedded in traditional classroom equipment and furniture. For example, the classroom contains a commercial touch sensitive inter‐ active whiteboard, technologically augmented student desks [21] that integrates various sensors (e.g., eye-tracker, camera, microphone, etc.), a personal workstation and a smart watch for the teacher, as well as various ambient facilities appropriate for monitoring the overall environment and the learners' actions (e.g., microphones, user-tracking devices, etc.).

The software architecture (Fig. 1b) of the smart Classroom follows a stack-based model where the first layer, namely the AmI-Solertis middleware infrastructure [15], is responsible for (i) the collection, analysis and storage of the metadata regarding the environment's artifacts, (ii) their deployment, execution and monitoring in the AmI-Solertis-enabled systems to formulate a ubiquitous ecosystem. The next three layers, namely the ClassMATE, CognitOS and LECTOR frameworks, expose the core libraries and finally the remaining layer contains the educational applications. Specifically, ClassMATE [14] is an integrated architecture for pervasive computing environments that monitors the ambient environment and makes context-aware decisions; it features a sophisticated, unobtrusive, profiling mechanism in order to provide user related data to the classroom's services and applications. Furthermore, CognitOS [18] delivers a sophisticated environment for educational applications hosting able to present inter‐ ventions that will be dictated by LECTOR.

**Fig. 1.** (**a**) LECTOR's SENSE-THINK-ACT – LEARN model. (**b)** The software architecture of the smart classroom

### **4 LECTOR Approach**

LECTOR introduces a non-invasive multimodal solution, which exploits the potential of ambient intelligence technologies to observe student actions (SENSE), provides a framework to employ activity recognition techniques for identifying whether these actions signify inattentive behavior (THINK) and intervenes –when necessary– by suggesting appropriate methods for recapturing attention (ACT). According to cognitive psychology, the sense-think-act cycle stems from the processing nature of human beings that receive input from the environment (perception), process that information (thinking), and act upon the decision reached (behavior). Such pattern became the base for many design principles regarding autonomous agents and traditional AI.

For that to be optimally achieved, the proposed system is able to make informed decisions using volatile information and reliable knowledge regarding the syllabus covered so far, the nature of the current activity, the "expected" behavior of the involved individuals towards it, the behavior of the peers, etc. The aforementioned pieces of information can be classified under the broader term of Context of Use, defined as follows: "Any information that can be used to characterize the situation of entities (i.e., whether a person, place, or object) that are considered relevant to the interaction between a user and an application, including the user and the application themselves. Context is typically the location, identity, and state of people, groups, and computational and phys‐ ical objects" [8]. Based on the above, the SENSE-THINK-ACT model of LECTOR relies on an extensible modeling component to collect and expose such classroomspecific information.

This work extends the SENSE-THINK-ACT model by introducing the notion of LEARN (Fig. 1a). The fact that the nature of this system enables continuous observation of student activities creates the foundation for a mechanism that provides updated knowledge to the decision-making components. In more details, the LEARN-ing mech‐ anism is able to (i) assess decisions that resulted in negative outcomes in the past (e.g., inattention levels remain high or deteriorate after introducing a mini-quiz intervention during a math course) and (ii) incorporate knowledge provided by the teacher (e.g., disambiguation of student behavior, rejection of suggested intervention during a specific course, etc.).

### **4.1 Motivating Scenarios**

**Monitoring the Attention Levels of an Entire Classroom.** On Monday morning the history teacher, Mr. James, enters the classroom and announces that the topic of the day will be the "Battle of Gaugamela". During the first 15 min the students pay attention to the teacher who narrates the story; soon enough, the students start losing interest and demonstrate signs of inattentive behavior. In more details, John is browsing through the pages of a different book, Mary and Helen are whispering to each other, Peter stares out the window and Mike struggles to keep his eyes open. When identifying that the entire classroom demonstrates signs of inattention, the system recommends that the lecture should be paused and that a mini quiz game should be started. The teacher finishes up his sentence and decides to accept this intervention. After his confirmation, a set of questions relevant to the current topic is displayed on the classroom board, while their difficulty depends on both the students' prior knowledge and the studied material so far. During use, the system identifies the topics with the lowest scores and notifies the teacher to explain them more thoroughly. As soon as the intervention ends, Mr. James resumes the lecture. At this point, the students' attention is reset and they begin to pay attention to the historical facts. As a result, the quiz not only restored their interest, but also resulted in deeper learning.

**Monitoring the Attention Levels of an Individual Student.** During the geography class Kate is distracted by a couple of students standing outside the window. The system recognizes that behavior and takes immediate action to attract her interest back on the lecture. To do so, it displays pictures relevant to the current topic on her personal work‐ station while a discreet nudge attracts her attention. A picture displaying a dolphin with weird colors swimming in the waters of Amazon makes her wondering how it is possible for a dolphin to survive in a river; she patiently waits for the teacher to complete his narration to ask questions about that strange creature. That way, Kate becomes motivated and starts paying attention to the presentation of America's rivers. At the same time, Nick is drawing random pictures on his notebook and seems to not pay attention to the lecture; however, the system already knows that he concentrates more easily when doodling, and decides not to interpret that behavior as inattention.

#### **4.2 Context of Use**

LECTOR's decision-mechanisms are heavily dependent on contextual information to (i) identify the actual conditions (student status, lecture progress, task at hand, etc.) that prevail in a smart classroom at any given time and (ii) act accordingly. The term context has been used broadly with a variety of meanings for context-aware applications in pervasive computing [9]. The authors in [10] refer to contexts as any information that can be detected through low-level sensor readings; for instance, in a home environment those reading include the room that the inhabitant is in, the objects that the inhabitant interacts with, whether the inhabitant is currently mobile, the time of the day when an activity is being performed, etc.

However, in a smart classroom contextual awareness goes beyond data collected from sensors. Despite the fact that sensorial readings are important for recognizing student activities, they are inadequate to signify inattention without information regarding the nature of the current course, the task at hand, the characteristics of the learner, etc. This work employs the term Physical Context (PC) to indicate data collected from sensors, while the term Virtual Learning Context (VLC) is used for any static or dynamic information regarding the learning process (e.g., student profile, course related information, etc.) [32].

The exploitation of such contextual information can improve the performance of the THINK component, which employs activity recognition strategies in order to identify student activities and classify them as inattentive or not. Despite the fact that activity recognition mainly relies on sensor readings to detect student activities, the Virtual Learning Context (VLC) is critical to interpret inattention indicators correctly; as an example, in general excess noise indicates that students talk to each other instead of listening to the teacher; however, this is irrelevant during the music class.

Furthermore, VLC is essential for the ACT component; when the system decides to intervene in order to reset students' attention, the selection of the appropriate interven‐ tion type depends heavily on the context of use (syllabus covered so far, remaining time, etc.). As an example, if an intervention occurs during the first ten minutes of a lecture, where the main topic has not been thoroughly analyzed by the teacher yet, the system starts a short preview that briefly introduces the lecture's main points using entertaining communication channels (e.g., multimedia content).

#### **4.3 Sensorial Data**

LECTOR is deployed in a "smart classroom" that incorporates infrastructure able to monitor the learners' actions and provide the necessary input to the decision-making components for estimating their attention levels. To ensure scalability, this work is not bound to certain technological solutions; it embraces the fundamental concept of Ambient Intelligence that expects environments to be dynamically formed as devices constantly change their availability. As a consequence, a key requirement is to ensure that new sensors and applications can be seamlessly integrated (i.e., extensibility). In order to do so, LECTOR relies on the AmI-Solertis framework, which provides the necessary functionality for the intercommunication and interoperability of heteroge‐ neous services hosted in the smart classroom.

As regards the supported input sources, they range from simple converters (or even chains of converters) that measure physical quantities and convert them to signals, which can be read by electronic instruments, to software components (e.g., a single module, an application, a suite of applications, etc.) that monitor human computer interaction and data exchange. However, a closer look at the sensorial data reveals that it is not the actual value that matters, but rather the meaning of that value. For instance, the attention recognition mechanism does not need to know that a student has turned his head 23° towards south but that he stares out of the window.

Subsequently, LECTOR equips the developers with an authoring tool that enables them to provide the algorithms that translate the raw data into meaningful high-level objects. In more details, through an intuitive wizard (Fig. 2) the developers (i) define the contextual properties (e.g., Speech, Feelings, Posture, etc.) that will be monitored by the system, (ii) specify the attributes of those properties (e.g., level, rate, duration, etc.) and (iii) develop the code that translates the actual values coming directly from the sensors/applications to those attributes. The in-vitro environment where LECTOR is deployed employs the following ambient facilities:


LECTOR currently uses the aforementioned ambient facilities to monitor some physical characteristics of the students and teachers and translates them, in a contextdependent manner, into specific activities classified under the following categories: Focus, Speech, Location, Posture and Feelings, which are considered appropriate cues that might signify inattention [2, 11, 19, 25].


**Fig. 2.** Snapshot from the developers' authoring tool, displaying the process of defining the 'SOUND' contextual property.

#### **4.4 Inattention Alarms**

LECTOR's THINK component (Fig. 3) is responsible for identifying the students who show signs of inattention. Towards such objective, it constantly monitors their actions in order to detect (sub-) activities that imply distraction and loss of attention. The deci‐ sion logic that dictates which behaviors signify inattention is expressed via high-level rules in the "Attention rule set", which combines various contextual parameters to define the conditions under which a student is considered distracted. There are two type of rules in the "Attention rule set": (i) rules that denote human activities or sub-activities (e.g., talking, walking, sitting, etc.) and provide input to (ii) rules that signify inattentive behaviors (e.g., disturb, chat, cheat, etc.). Through an educator-friendly authoring tool, namely LECTORstudio [12], the teachers have the opportunity to create or modify the latter, while -due to their complexity- they can only fine-tune the rules that denote human (sub-) activities.

Whenever a stimulus is detected by the SENSE component, the THINK component initiates an exploratory process to determine whether the incoming event indicates that the student(s) has lost interest in the learning process or not. In order to do so, it employs the appropriate attention recognition strategies based on the "Attention rule set". Finally, at the end of the exploratory process, if the result points to inattentive behavior, SENSE appropriately informs the ACT component which undertakes to restore student engage‐ ment by selecting an appropriate intervention.

Figure 4 presents the graphical representation of a rule describing the activity "SHOUTING", as created in LECTORstudio. Specifically, the purpose of this rule is to create an exception for the Music course, where students sing, thus raising the noise levels of the classroom higher than usual; in that case, the activity "SHOUTING" should be identified when the sound volume captured through the class microphone exceeds the value of 82 dB.

**Fig. 3.** LECTOR's THINK component.

**Fig. 4.** A rule describing the activity "SHOUTING", as created in LECTORstudio.

### **4.5 Intervention Rules**

As soon as inattentive behavior is detected, the ACT component (Fig. 5) initiates an exploratory process to identify the most appropriate course of action. Evidently, selecting a suitable intervention and its proper presentation (appropriate for the device where it will be delivered) is not a straightforward process, as it requires in-depth anal‐ ysis of both the learners' profile and the contextual information regarding the current course. The first step is to consult the "Intervention rule set", which, similarly to the "Attention rule set", is comprised of high-level rules describing the conditions under which each intervention should be selected (e.g., if all students are distracted during the math course, recommend an interactive task like a mini-quiz) as well as the appropriate means of presentation (e.g., if a mini-quiz is selected and the intervention is intended for all students, display it to the classroom interactive board).

**Fig. 5.** LECTOR's ACT component.

Each intervention rule, upon evaluation, points to a certain intervention strategy into the "Interventions' Pool" (IP). The IP includes high-level descriptions of the available strategies, along with their low-level implementation descriptions. Furthermore, since inattention can originate either from a single student or the entire classroom, the ACT component should be able to evaluate and select strategies targeting either an individual student or a group of students (even the entire class). To this end, the "Interventions' Pool" should contain interventions of both types, and the decision logic should be able to select the most appropriate one. After selecting the appropriate intervention, the system personalizes its content to the targeted student and converts it to a form suitable for the intended presentation device.

LECTORstudio also permits the teachers to tailor the intervention mechanism to the needs of their course by modifying the "Intervention Rule Set". In more details, a teacher can create custom interventions, customize existing ones in terms of their content, change the conditions under which an intervention is initiated (e.g., the percentage of distracted students), etc.

#### **4.6 Intervention Assessment**

Both the THINK and ACT components are able to "learn" from previous poor decisions and refine their logic, while they are open to expert suggestions that can override their defaults. In order to introduce the notion of LEARN, LECTOR provides mechanisms that modify the decision-making processes by correlating knowledge gathered through attention monitoring with student performance and expert input.

To this end, the LEARN component is able to assess the regression of students' attention lapses -through the respective student profile component- with a formerly applied intervention to identify whether it had positive results or it failed to reset atten‐ tion. In more details, if the system estimates that a particular intervention will reset attention in the context of a specific course and applies it, then after a reasonable amount of time it re-calculates the current attention levels; if it still detects that the students are not committed to the learning process, then the selected recommendation is marked as ineffective in that context. Hence, the ACT component is informed so as to modify its decision logic accordingly, and from that point forward select different interventions for that particular course instead of the one that was proven to be unsuccessful.

On top of the automatic application of active learning interventions, the system also permits additions, modifications, cancellations and ranking of the selected interventions. This allows the teacher to have the final say regarding the lecture format. To this end, the LEARN component takes into consideration the teacher's input and appropriately inform the ACT component so as to refine the intervention rule set and offer more effective alternatives when necessary. In more details, the teacher should be able to: (i) change the recommended intervention with a more appropriate one (e.g., quiz, multi‐ media presentation, discussion, etc.), (ii) rank the recommendation and (iii) abort the intervention in case it disrupts the flow of the course.

### **5 Conclusions and Future Work**

LECTOR provides a framework and an educator-friendly design studio for the smart classroom in order to improve the educational process. For that to be achieved, it equips the environment with a system that is able to monitor the learners' attention levels depending on rules created by the teachers themselves and intervenes, when necessary, to (i) provide a motivating activity to a distracted student or (ii) suggest an alternative pedagogy that would be beneficial for the entire classroom (e.g., by motivating indi‐ viduals or suggesting different lecture formats, etc.).

Future work includes full-scale evaluation experiments in order validate the system's efficacy and usability. In particular, two types of user-based experiments will be conducted: (i) Experiments for assessing the usability of the design studio for the teach‐ er's. (ii) Experiments for evaluating the system as a whole. These experiments will be conducted for an extended period of time inside the smart classroom environment, where students and teachers will be engaged with several educational activities while the system will monitor the learners' attention levels throughout the entire process and intervene when necessary. The results of this evaluation will be used to identify whether the system can: (a) appropriately adapt its behavior in order to respect teachers' input, and (b) positively affect –through the delivery of personalized interventions– the students' motivation level and overall performance.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Predicting Driver's Work Performance in Driving Simulator Based on Physiological Indices**

Cong Chi Tran1,2(✉) , Shengyuan Yan<sup>1</sup> , Jean Luc Habiyaremye1 , and Yingying Wei3

<sup>1</sup> Harbin Engineering University, Harbin 150001, China <sup>2</sup> Vietnam National University of Forestry, Hanoi 10000, Vietnam trancongchi\_bk@yahoo.com, yanshengyuan@hrbeu.edu.cn, habijealuc@yahoo.fr <sup>3</sup> East University of Heilongjiang, Harbin 150086, China weiyingying2007@126.com

**Abstract.** Developing an early warning model based on mental workload (MWL) to predict the driver's performance is critical and helpful, especially for new or less experienced drivers. This study aims to investigate the correlation between human's MWL and work performance and develop the predictive model in the driving task using driving simulator. The performance measure (number of errors), subjective rating (NASA Task Load Index) as well as six physiological indices were assessed and measured. Additionally, the group method of data handling (GMDH) was used to establish the work performance model. The results indicate that different complexity levels of driving task have a significant effect on the driver's performance, and the predictive performance model integrates different physiological measures shows the validity of the proposed model is well with R2 = 0.781. The proposed model is expected to provide a reference value of their work performance by giving physiological indices. Based on this model, the driving lesson plans will be proposed to sustain the appropriate MWL as well as improve work performance.

**Keywords:** Driving simulator · Work performance · Predictive model

### **1 Introduction**

Reducing road accident is an important issue. Contributing factors to crashes are commonly classified as human, vehicle or roadway and environmental [1]. Driving is often heavy mental workload (MWL) tasks, because in order to prevent accidents, drivers of must continually acquire and process much information from their eyes, ears, and other sensory organs. The information includes the movements of other vehicles and pedestrians, road signs and traffic signals, and various situations and changes in the road environment. These incidents require a lot of driver's attention. Human errors such as misperception, information processing errors, and slow decision making are frequently identified as major reasons can cause the accidents [2]. Therefore, improving driver's MWL could be helpful in improving driver performance and reducing the number of accidents.

For most drivers, both excessive and low MWL could degrade their performance, and furthermore, may affect the safety of the driver and others. Because of when the situation is low-demanding (e.g., in long and boring roads), or conversely when the situation is high demanding (e.g., in the city with much information to process), drivers are overloaded with an increase of workload leading to performance impairments [3, 4]. Only with an appropriate level of MWL, the drivers can perform the right tasks. Therefore, for the purpose of driver's safety, developing an early warning model based on MWL to predict the driver's performance is critical and helpful, especially for new drivers or little experience in driver training.

MWL refers to the portion of operator information processing capacity or resources that is actually required to meet system demands [5]. The MWL is induced not only by cognitive demands of the tasks but also by other factors, such as stress, fatigue and the level of motivation [6, 7]. In many studies of driving task, the MWL was measured by subjective measures, such as NASA task load index (NASA-TLX) [8–10]. However, a major limitation of subjective measures is that they can only assess the overall experi‐ ence of the workload of driving but cannot reflect changes in workload during the execution of the task. Also, rating scale results also can be affected by characteristics of respondents, like biases, response sets, errors and protest attitudes [11, 12]. Thus, the continuous and objective measures (e.g. physiological signal) to assess the MWL in addition to evaluating the overall workload in driving tasks is necessary [13].

Recently, many driving simulators can measure performance accurately and effi‐ ciently, and they are more and more used in driving education tasks. It is commonly accepted that the use of driving simulators presents some advantages over the traditional methods of drive learning because their virtual nature, the risk of damage due to incom‐ petent driving is null [14]. In addition, simulators make it possible to study hazard anticipation and perception by exposing drivers to dangerous driving tasks, which is an ethically challenging endeavor in real vehicles [15], and also offers an opportunity to learn from mistakes in a forgiving environment [16, 17]. In this study, we conducted an experiment to simulate the car driving tasks to assess the relation between work perform‐ ance, subjective rating, and physiological indices for new drivers. According to these relationships, the study developed a predictive model by using the group method of data handling (GMDH) to integrate all physiological indices into a synthesized index. The physiological indices used in this study were the eye activities (pupil dilation, blink rate, blink duration, fixation duration) and cardiac activities (heart rate, heart rate variability). The performance of the task was measured by the number of errors, and the subjective rating was rated by the NASA-TLX questionnaire.

### **2 Methodology**

#### **2.1 Participants**

Twenty-six male engineering students voluntary, age 19.2 ± 1.1 years (mean ± SD) participated in the experiment. They have very little (less than two months) or no driving experience. They have normal eyesight (normal or corrected to normal vision in both eyes) and good health. For ensuring the objectivity of experimental electrocardiography (ECG) data, all participants were asked to refrain from caffeine, alcohol, tobacco, and drug six hours before the experiment. All participants completed and signed an informed consent form approved by the university and were compensated with extra credit in extracurricular activities in their course.

### **2.2 Apparatus**

A driving simulator (Keteng steering wheel and City car driving software version 1.4.1) was used in this study. The city car driving is a car simulator, designed to help users feel the car driving in a city or a country in different conditions. Special stress in the City car driving simulator has been laid on the variety of road situations and realistic car driving.

IView X head mounted eye-tracking device (SensoMotoric Instruments) was used to record participants' dominant eye movements. Software configuration has the video recording and the BeGaze version 3.0 eye movement data analysis, sampling rate 50/60 Hz (optional 200 Hz), tracking resolution, pupil/Corneal reflection <0.1° (typical) and gaze position accuracy <0.5°–1.0° (typical). ANSWatch TS0411 was used to measure the heart rate (HR) and HRV (heart rate variability) data.

### **2.3 Work Performance and Mental Workload Measures**

Various MWL measurements have been proposed, and these measurements could be divided into three categories: performance measure, physiological measures and subjec‐ tive ratings [18]. Performance measures can be classified into many categories such as accuracy, task time, worst-case performance, etc. [19]. In this study, the number of errors of driving task was calculated because of some reasons: (1) driving errors to involve risky behaviors that we need to understand to prevent accidents and fatalities [20]. In addition, many studies had shown that the number of errors has a sensitive to differences in the visual environment [21, 22]. (2) in the City car driving software, all driving errors include such as didn't follow the speed limit, driving on the red light, no turn signal when changing the lane, accident and so forth are displayed when driving and counted after finish the task.

Subjective ratings are designed to collect the opinions from the operators about the MWL they experience using rating scales. With the low cost and the ease of adminis‐ tration, as well as adaptability, have been found highly useful in driving tasks [20, 23]. In this study, subjective ratings NASA-TLX [24] was used to evaluate the driver's MWL because of there are many studies successfully applied to measure MWL in the driving [8, 9]. NASA-TLX is a multi-dimensional rating scale using six dimensions of workload to provide diagnostic information about the nature and relative contribution of each dimension in influencing overall operator workload. Six dimensions to assess MWL including mental demand (MD), physical demand (PD), temporal demand (TD), own performance (OP), effort (EF) and frustration (FR).

Physiological measures can be further divided into central nervous system measures and peripheral nervous system measures [25]. These methods do not require the user to generate overt responses, they allow a direct and continuous measurement of the current workload level, and they have high temporal sensitivity and can thus detect short periods of elevated workload [26]. Although central nervous system measures (i.e. electrocar‐ diogram) has high reliability in measurement of driver's MWL [13], the applicability of these measures is limited due to the expensive instruments so it was not suitable to the conditions of this experiment. Therefore, the central nervous system measures were not used in this study.

Eye activity is a technique that captures eye behavior in response to a visual stimulus, and this technique has become a widely used method to analyze human behavior [27]. Eye response components that have been used as MWL measures include pupil dilation, blink rate, blink duration and fixations. Human pupil dilation may be used as a measure of the psychological load because it is related to the amount of cognitive control, atten‐ tion, and cognitive processing required for a given task [28]. It also has been previously shown to correlate with the cognitive workload, whereby increased frequency of dilation is associated with increased degree of difficulty of a task [29]. In the driving study, pupil dilation was able to reflect the load required by tasks [30], and it would measure the average arousal underlying the cognitive tasks [31]. The blink of the eye, the rapid closing, and reopening of the eyelid is believed to be an indicator of both fatigue and workload. It is well known that eye blink rate is a good indicator of fatigue. Blink rate has been investigated in a series of driver and workload studies with mixed results attributable to the distinction between mental and visual workload [31]. They suggested that blink rate is affected by both MWL and visual demand, which act in opposition to each other, the former leading to blink rate increase, the latter to blink rate decrease. Besides blink rate, blink duration has been shown to be affected by visual task demand. Blink duration has been shown to decrease with increases in MWL. The studies mentioned in Kramer's review all found shorter blink durations for increasing task demands (both mental and visual) [32]. Some studies show that blink duration is a sensitive and reliable indicator of driver visual workload [8, 33]. Eye fixation duration is also extensively used measures and is believed to increase with increasing mental task demands [34]. Recently, fixation duration and the number of fixations have also been investigated in a series of studies about driver hazard perception and they found that increased fixation durations during hazardous moments, indicating increased MWL [20].

The heart rate (HR) and heart rate variability (HRV) potentially offer objective, continuous, and nonintrusive measures of human operator's MWL [26]. Numerous studies show that HR reflects the interaction of low MWL and fatigue during driving [35, 36]. In addition to basic HR, there has also been growing interest in various measures of HRV. Spectral analysis of HRV enables investigators to decompose HRV into components associated with different biological mechanisms, such as the sympathetic/ parasympathetic ratio or the low frequency power/high frequency power (LF/HF) ratio, the mean inter-beat (RR), the standard deviation of normal RR intervals (SDNN), etc. The SDNN reflects the level of sympathetic activity about parasympathetic activity and has been found to increase with an increase in the level of MWL [13, 25].

### **2.4 Experimental Task**

There were three levels of task complexity in this experiment such as high, medium and low. Special stress in the City car driving simulator has been laid on the variety of road situations and realistic car driving. The condition setting of task shown in Table 1.


**Table 1.** Experiment task setting

### **2.5 Group Method of Data Handling**

Actually, there are many methods used to develop the predictive model such as GMDH, Neural Networks, Logistic regression, Naive Bayes, etc. This study used the GMDH method [37] to establish a prediction model of work performance. This is a widely used neural network methodology which requires no assumptions of the relationship between predictors and responses [38]. The GMDH algorithm has been widely used in various fields, e.g. nuclear power plants [25], Stirling engine design [39], education [40]. This study investigated the relationship between seven physiological indices and work performance on different levels of task complexity.

### **2.6 Procedure**

All participants received about two hours of training. During the training, they were taught how to use the eye tracking equipment, complete the NASA-TLX questionnaire and driving simulator. After that, each participant was received about 30 min to practice by himself on the driving simulator. This practice served the purpose of familiarizing subjects with the simulator and the general feel of the pedals and steering. The practice step would end until the participant was sure that he understood all procedures. The experiment was conducted on the next day.

Before the experiment, the participant took a 20 min rest, and then wore the meas‐ urement apparatus and proceeded with system adjustment. The initial physiological indices were acquired as a baseline before the experiment. During experiment, the phys‐ iological indices were collected during each phase (level of task complexity), and the NASA-TLX questionnaire was conducted after each phase to evaluate the subjective MWL of different levels of task complexity. Each phase lasted for about 20 min and had 5 min break after each phase. The limitation of driving speed limits in this study was required less 45 km/h.

The scenario included a normal driving environment in the city (2 km of city roads with some stop signs or crossing lights). Each participant was made to test the three level of the task in a randomized order (Fig. 1). They were asked to follow speed limits and to comply with traffic laws throughout the course of the experiment. Three level of workload with high, medium and low of task complexity in this experiment shown in Table 1.

**Fig. 1.** Driving task in the experiment: (A) Low task; (B) Medium task; (C) High task

### **3 Results**

#### **3.1 Sensitivity with the Workload Level**

At alpha level of .05, a MANOVA results showed that there are a statistically significant difference in task levels, *F*(16, 136) = 3.52, *p* < .0005; Wilk's *Λ* = .50, partial *η*<sup>2</sup> = .293 with the high observed power of 99.1%. Descriptive statistics was presented in Table 2. There were significant differences in almost methods between workload levels in this driving task, however; no significant difference was found in pupil dilation (*p* = .574) and fixation duration (*p* = .143). The number of errors in performance measure showed that the high task has significantly higher error than the low task by almost 23.3% (Tukey HSD *p* = .036). However, there was no significant difference between the high task and medium task (Tukey HSD *p* = .261), and medium task and low task (Tukey HSD *p* = .561).


**Table 2.** Sensitivity with the workload level

\* p ≤ .05, \*\* p ≤ .001

### **3.2 Correlation Between the Number of Errors and Other Methods**

The analysis of correlation was used to examine the relationship between the number of errors and other methods as shown in Table 3. It indicated that the number of errors and the NASA-TLX was positively correlated with each other. The correlation coeffi‐ cient of *r* = 0.563 was found to be statistically significant at *p* < 0.01 (two-tailed). Mean of NASA-TLX score and the number of errors of each participant shown in Fig. 2.


**Table 3.** Correlation between the number of errors and other methods

\*\* Correlation is significant at the 0.01 level (2-tailed).

\* Correlation is significant at the 0.05 level (2-tailed).

The statistic also showed that most physiological measures in this study correlate significantly with the number of errors indicate that physiological measures may assess the work performance by participants in the driving complex task.

**Fig. 2.** Mean of the number of errors and NASA-TLX score of each participant in driving task

#### **3.3 Predicting the Number of Errors by Integrating Physiological Measures**

Six physiological indices, including pupil dilation (*X1*), blink rate (*X2*), blink duration (*X3*), fixation duration (*X4*), HR (*X5*) and SDNN (*X6*) into a synthesized index and to establish a model of work performance, this study used the GMDH method and the predictive modeling software DTREG version 10.6. The ratio of training and testing in this study was selected as 80%:20% to fit in with the available experimental sample size of 26. Each input variable (*Xi*) was normalized to a range of 0 and 1 before the training and testing process begins. The network was trained by using a random training data set, and the training data was also never used in the test data.

The results indicated that physiological indices of X1, X2, and X5 were the best significant predictor factors in the performance by the subject. The model is expressed by Eq. (1) with the mean square error was 1.03, and R2 of the model was 78.1%.

```
Y =4.816 + 1.152X5 + 0.588X2 + 0.233X1 − 0.477X2
                                                  5 + 0.091X2
                                                             2 + 0.095X2
                                                                       1 − 0.433X5X2
   −0.290X5X1 − 0.163X2X1 + 1.125X5X2X1 − 0.451X3
                                                     5 − 0.276X3
                                                               2 + 0.027X3
                                                                          1 − 0.467X5X2
                                                                                       2
   −0.309X5X2
             1 − 0.844X2X2
                           5 + 0.010X2X2
                                        1 + 1.079X1X2
                                                     5 + 0.217X1X2
                                                                  2
                                                                                             (1)
```
In the validation data, the result showed that the mean target value for predicted values is 4.62 while mean target value for input data is 4.5 (97.4%). Therefore, this model was suitable to estimate the performance of different MWL based on physiological measures in driving tasks.

### **4 Discussion**

The number of errors was calculated as performance measures for the driving tasks in this study. The evaluation result showed that increasing task complexity makes increase the number of errors. This result is consistent with numerous studies which had found that the human's performance was affected when the MWL was low [41]. On the other hand, the NASA-TLX scores showed a significant correlation with the different levels of MWL. For most of the subjects, the highest NASA-TLX score occurred in the high task complexity phase whereas the lowest score happened in the low task complexity phase. This result indicated that these tasks used in this experiment could distinguish the different levels of MWL.

Eye response measures are useful to reflect temporal distribution workload levels in driving task. However, no significant difference was found in pupil dilation and fixation duration. This result indicated that the pupil dilation in this experiment might not repre‐ sent an increased processing need but rather reflects an increased attention and arousal caused by errors. This finding is consistent with Bradshaw's study in which he found that the pupil size change was not linked to the task complexity, but instead to the level of arousal of participants in problem-solving tasks [42]. Fixation duration index is extensively used measures and is believed to increase with increasing mental task demands [34], and Goldberg and Kotval [43] also found a negative correlation between fixation time and performance. Although the overall significance in fixation duration between different task levels was not found, there was a significant difference between the high task and low task. This result could be explained that the difference between the task levels (low-medium-high) is small.

Cardiac responses such as HR, HRV were used, and these responses seem more sensitive to the accumulative workload than eye response measures do. The experi‐ mental result indicated that mean of participants' HR and HRV components increased when the task complexity increased. These findings were consistent with previous studies [13, 44]. The participants in driving task needed to continuously exert mental effort to keep alert, and fatigue may have reduced the participants' attention. O'Hanlon [45] found that the initial decrease was changed into a gradual increase in HRV in longtime continuous driving and Tripathi, Mukundan and Mathew [46] also found that HRV increased in high-demand vigilance tasks that also require continuous exertion. Another plausible reason is the interaction influence of respiration on HR and HRV. A cognitive load promotes oxygen demand by cells and leads to the production of more cardiac output by increasing HR [47]. During the execution of tasks, participants breathed deep and long, which will increase.

Finally, this study used GMDH method to construct a model to predict the driver's work performance on different workload levels. Although the statistic in table showed that blink rate and HRV measure no correlates with the number of errors significantly at the level of .05, the predictive model that integrates different physiological measures explains 78.1% of the number of errors. With this model, it could provide a reliable reference tool to predict the work performance of drivers.

#### **4.1 Limitations**

Some limitations of this study should be mentioned. First, the experiment has used a small sample of a student population to evaluate and predict model; the small sample size reduced the statistical power. These students also do not represent the characteristics of the people who want to learn the driving car. In addition, in the simulation condition, the participants often have psychological comfort because they must suffer the conse‐ quences of their mistakes when the operation fails or does not fulfill the requirements of the task. This causes for lack of significant differences among the outcomes and assessment results had limits of reliability. Finally, this result has been not shown the causal relationship between the physiological measures and the error rate but show a correlation between them under certain situations.

### **5 Conclusions**

This paper reports the correlation of human's MWL and work performance in the driving task using driving simulator based on NASA-TLX and six physiological indices. The results show that different complexity levels of the driving task have a significant effect on the new driver's performance. In six physiological indices were used, three indices of pupil dilation, blink duration, and HR were the significant predictor factors, and the validity of this model was very well with R2 = 0.78. Therefore, this model can be used to predict the new driver's work performance and maybe apply for actual. Although the model development process is still in an early phase, it can be used to predict the value of a new driver or little experience driving people on practice phase procedure.

**Acknowledgements.** The authors would like to thank the reviewers for their valuable remarks and comments. Also, the authors thank the participants who helped conduct this research.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### Machine Perception of Humans

### **Exploring the Dynamics of Relationships Between Expressed and Experienced Emotions**

Ramya Srinivasan(B), Ajay Chander, and Cathrine L. Dam

Fujitsu Laboratories of America, Sunnyvale, CA, USA ramya@us.fujitsu.com

**Abstract.** Conversational user interfaces (CUIs) are rapidly evolving towards being ubiquitous as human-machine interfaces. Often, CUI backends are powered by a combination of human and machine intelligence, to address queries efficiently. Depending on the type of conversation issue, human-to-human conversations in CUIs (i.e. a human end-user conversing with the human in the CUI backend) could involve varying amounts of emotional content. While some of these emotions could be expressed through the conversation, others are experienced internally within the individual. Understanding the relationship between these two emotion modalities in the end-user could help to analyze and address the conversation issue better. Towards this, we propose an emotion analytic metric that can estimate experienced emotions based on its knowledge about expressed emotions in a user. Our findings point to the possibility of augmenting CUIs with an algorithmically guided emotional sense, which would help in having more effective conversations with end-users.

**Keywords:** Conversational user interfaces Expressed and experienced emotions

### **1 Introduction**

Conversational user interfaces (CUIs) are interactive user interfaces that allow users to express themselves conversationally, and are often powered by a combination of humans and machines at the back end [1]. Across a wide range of applications, from assisting with voice-command texting while driving to sending alerts when household consumables need to be ordered, CUIs have become a part of our everyday lives. In particular, bots within messaging platforms have witnessed rapid consumer proliferation. These platforms cater to a wide spectrum of human queries and messages, both domain-specific as well as general purpose [2].

Depending on the type of issue being discussed, human-to-human conversations in CUIs (i.e., conversations between human in the CUI backend and the human end-user) could involve varying amounts of emotional content. While some of these emotions could be expressed in the conversation, others are felt or experienced internally within the individual [12]. An expressed emotion need not always correspond to what the user is actually experiencing internally. For example, one can suppress the internal feelings and express some other emotion in order to be consistent with certain socio-cultural norms [23]. Since experienced emotions are felt internally, they may not be easily perceived by others.

Understanding relationship between expressed and experienced emotions could facilitate better communication between the end-user and human in the CUI backend [7]. Analyzing experienced emotions could also help in uncovering certain aspects of an individual that needs attention and care. For example, a feeling of extreme sadness within an individual could be expressed externally as anger [6]. Employing this type of emotion metric could enhance both the scope and usage of CUIs. In this paper, we propose such an emotion metric by developing a machine learning method to estimate probabilities of experienced emotions based on the expressed emotions of a user.

*Problem Setting:* We consider the scenario of textual conversations involving individuals needing emotional support. For convenience, we refer to individuals needing support as users. On the other end of the conversation platform are the human listeners (typically counselors). The human listener chats directly with the user using a text-only interface and our algorithm (i.e. the machine) analyzes the texts of the end-user. The machine provides a quantitative assessment of the experienced emotions in the user's text. All assessments are specific to the user under consideration.

The machine first evaluates the conditional probability of experiencing an emotion *emo<sup>n</sup>* internally given that an emotion *emo<sup>m</sup>* is explicitly expressed. In the rest of this paper we represent this conditional probability as *Pt*(*emon|emom*). For example, the probability of experiencing sadness internally given that anger has been expressed, is represented as *Pt*(*sad|angry*). From these conditional probabilities, the probabilities of various experienced emotions (*P*(*emon*)) are obtained. A detailed explanation of the procedure is described in Sect. 3.

## **2 Related Work**

CUIs are used for a variety of applications. For example, IBM's Watson technology has been used to create a teaching assistant for a course taught at Georgia Tech [13], Google chatbot, "Danielle", can act like book characters [14], and so on. There are also emotion-based models for chatbots such as [25], wherein the authors propose to model the emotions of a conversational agent.

A summary of affect computing measures is provided in D'Mello et al. [16]. Mower et al. [27] propose an emotion classification paradigm based on emotion profiles. There have been efforts to make machines social and emotionally aware [23]. There are methods to understand sentiments in human-computer dialogues [18], in naturalistic user behavior [24] and even in handwriting [26]. However, we are not aware of any work that estimates the underlying, experienced emotions in text conversations.

Bayesian theory has been used to understand many kinds of relationships in domains such as computer vision, natural language processing, economics, medicine, etc. For example, Ba et al. [15] use Bayesian methods for head and pose estimation. Dang et al. [17] leverage Bayesian framework for metaphor identification. Bayesian inference has been used in recent years to develop algorithms for identifying e-mail spam [28]. More recently, Microsoft Research created a Bayesian network with the goal of accurately modeling the relative skill of players in head-to-head competitions [29]. Our work describes a new application of Bayesian theory, namely, to estimate experienced emotions in text conversations.

## **3 Method**

Let the conditional probability of experiencing an emotion *emo<sup>n</sup>* given that an emotion *emo<sup>m</sup>* is expressed be denoted by *Pt*(*emon|emom*). We evaluate *Pt*(*emon|emom*) using a Bayesian framework. These are then normalized over the space of all expressed emotions *emo<sup>m</sup>* to obtain the probabilities of various experienced emotions *emon*.

First, an emotion recognition algorithm is run on the end-user's texts to determine the probabilities of various expressed emotions. These probabilities serve as priors in the Bayesian framework. Next, we leverage large datasets containing emotional content across many people (such as blogs, etc.) to measure the similarities between words corresponding to a pair of emotions. This information is computed across several people and is reflective of the general relatedness between two emotion-indicating words (for example, between the words "sad" and "angry"). This measure is then normalized (across all possible pairs of emotions considered) to constitute the likelihood probability in the Bayesian framework. The priors and likelihoods are then integrated to obtain *Pt*(*emon|emom*). This conditional probability is specific to the end-user under consideration. This is then normalized over all possible choices of expressed emotions to obtain probabilities of experienced emotions for the end-user under consideration.

While a variety of other approaches could be used for this computation, our choice of the Bayesian framework is motivated by the following facts. First, Bayesian models have been successful in characterizing several aspects of human cognition such as inductive learning, causal inference, language processing, social cognition, reasoning and perception [30]. Second, Bayesian learning incorporates the notion of prior knowledge which is a crucial element in human learning. Finally, these models have been successful in learning from limited data, akin to human inference [31].

#### **3.1 Estimation of Priors**

During the course of the user's conversation with a human listener, we perform text analysis at regular time instances to get probabilities of different emotions. These probabilities are determined based on the occurrences of words representative of emotions in user's text. In our setting, we measure the probability of the following emotions—happy, sad, angry, scared, surprised, worried, and troubled. We arrived at these seven emotions by augmenting commonly observed emotions in counseling platforms with those that are widely accepted in psychological research [32]. These probabilities provide some "prior" information about the user's emotions and hence serve as the priors in the Bayesian framework.

Let the prior probability of an emotion *i*, be denoted by *Pp*(*emoi*). Thus, there are multiple emotion variables, with each of these variables taking a value in the range [0*,* 1] indicating their probabilities. We leverage word synsets to obtain a rich set of words related to each of the emotions that we want to recognize. Synsets are defined as a set of synonyms for a word. Let the set of synsets across all the emotion categories be referred to as the emotion vocabulary. The words in a user's text are then matched for co-occurrence with the emotion vocabulary and are weighted (normalized) based on their frequency of occurrence to obtain probability of an emotion. We found this simple approach quite reliable for our data. This will give the probabilities for various expressed emotions.

#### **3.2 Estimation of Likelihoods**

We estimate similarities between words corresponding to a pair of emotions by training neural word embeddings on large datasets [9]. This similarity gives a measure of relatedness between two emotion-indicating words in a general sense. For example, if the word "sad" has higher similarity with word "anger" than with the word "worry", then we assume that the relatedness between emotions "sad" and "anger" is higher than the relatedness between "sad" and "worry". This may not necessarily be true with respect to every user, but is true in an average sense since the calculation is based on very large datasets of emotional content across several people. Since this measure is data-dependent, we have to choose appropriate datasets containing significant emotional content to get reliable estimates. We then normalize the similarity scores to obtain the likelihood probability. The details are as follows:

Specifically, we train a skip-gram model on a large corpus of news articles (over a million words), blogs and conversations that contain information pertaining to people's emotions, behavior, reactions and opinions. As a result, the model can provide an estimate of relatedness *remoi*−*emo<sup>j</sup>* between two emotions (*emo<sup>i</sup>* and *emo<sup>j</sup>* ) leveraging information across a wide set of people and contexts. This quantity is just capturing the relatedness between any two emotions in a general sense, and is not specific to a particular user. We compute likelihood probability of observing emotions *emo<sup>j</sup>* given *emoi*, *Pl*(*emo<sup>j</sup> |emoi*) based on normalizing the similarities *remoi*−*emo<sup>j</sup>* over the space of all possible emotions under consideration. Thus,

$$P\_l(emo\_j|emo\_i) = \frac{r\_{emo\_i - emo\_j}}{\sum\_{all - emo} r\_{emo\_i - emo\_j}}\tag{1}$$

The emotion pairs considered in Eq. (1) do not necessarily represent expressed or experienced emotions; the likelihood probability is just a measure of relatedness between a pair of emotions computed from large datasets.

#### **3.3 Estimating Conditional Probabilities**

We employ a Bayesian framework to integrate emotion priors with the likelihood probabilities. Let *Pp*(*emon*) be the prior probability of an emotion *emo<sup>n</sup>* as obtained from an emotion analysis algorithm, and *Pl*(*emom|emon*), be the likelihood probability of *emo<sup>m</sup>* given *emon*, obtained by using appropriate training dataset. Then, the posterior probability of experiencing an *emo<sup>n</sup>* given an expressed emotion *emo<sup>m</sup>* is given by

$$P\_t(emo\_n|emo\_m) = \frac{P\_l(emo\_m|emo\_n)P\_p(emo\_n)}{\sum\_{all \quad emo} P\_l(emo\_m|emo\_n)P\_p(emo\_n)}\tag{2}$$

The above quantity is specific to the user under consideration.

#### **3.4 Estimating Probabilities of Experienced Emotions**

The conditional probabilities computed from Eq. (2) are specific to a user. By normalizing these conditional probabilities across all possible choices of expressed emotions, we obtain the probabilities of various experienced emotions. Specifically,

$$P(emo\_b) = \sum\_a P\_t(emo\_b|emo\_a)P\_p(emo\_a) \tag{3}$$

where in *emo<sup>a</sup>* is an expressed emotion and *emo<sup>b</sup>* is an experienced emotion. The set of expressed and experienced emotions need not be mutually exclusive.

#### **3.5 Dataset**

We studied the performance of the algorithm on a dataset consisting of 16 anonymous user conversations with a human listener spanning a total of more than 20 h. Conversations between users and human listener dealt with a variety of topics such as relationship issues, emotional wellbeing, friendship problems, etc. On average, the conversation between a user and the human listener lasted approximately 30 min. Some of these conversations lasted more than an hour (the longest was 70 min) while some lasted only 10 min. We divided the conversations into segments corresponding to the time a user spoke uninterrupted by a human listener. For convenience we refer to each segment as a "transcript". Transcripts numbered A.x are all contiguous parts of the same conversation A. There were over fifty transcripts in the dataset.

### **4 Results**

We illustrate the performance of the proposed method on some user conversations. Users converse with a human listener, henceforth abbreviated as "HL". All results are specific to the user part of the conversation only and apply to the specific time interval only. The identities of the users and the human listener were anonymized by the conversation platform. It is to be noted that an experienced emotion could become expressed at a later time, so the set of expressed and experienced emotions are not mutually exclusive. Also, the algorithm can compute probabilities of experienced emotions only for those emotions for which there is a prior.

### **4.1 Case Studies**

### **Transcript 1.1 (0***th***–10***th* **min)**

user: Hi, can you please help me with anxiety.


user: Can you help?


Tables 1 and 2 list the expressed and experienced emotions during the first 10 min of the conversation.

**Table 1.** Expressed emotions for transcript 1.1: 0*th*–10*th* min




### **Transcript 1.2 (10***th***–20***th* **min)**

user: I Dont know why I am insecure with her, I just feel inadequate.

hl: You feel insecure and inadequate with her. Have you felt like this with other girlfriends?

user: Once before but not as bad. She is beautiful.


hl: So you think she is beautiful but you're not sure how she feels about you? user: I Dont know, I think I might be over eager and care for her too much.

Tables 3 and 4 lists the results of the algorithm.

**Table 3.** Expressed emotions for transcript 1.2: 10*th*–20*th* min


**Table 4.** Experienced emotions for transcript 1.2: 10*th*–20*th* min


Results are listed in Tables 5 and 6. Similar analysis was carried out throughout the conversation. The following is the last transcript of this conversation.

**Table 5.** Expressed emotions for transcript 1.3: 20*th*–30*th* min


**Table 6.** Experienced emotions for transcript 1.3: 20*th*–30*th* min


#### **Transcript 1.4 (40***th***–50***th* **min)**

user: I Dont have anyone I can confide in.


hl: yeah, it sounds like you feel really open but also very vulnerable because of everything you've shared. that's hard.

user: I'm very vulnerable. Should I go to the doctor?

hl: I'm not sure. If you're thinking about it, it might be a good idea. What kind of advice are you looking for from them?

user: I Dont know, maybe medication

hl: Ah, I see what you're saying. Medication can help a lot with anxiety for sure. It sounds like you're feeling really bad and anxious and really don't want to feel like this anymore. I think it's always good to find out if a doctor can help with something like that. . .

Tables 7 and 8 provide the assessment of expressed/experienced emotions.

**Table 7.** Expressed emotions for transcript 1.4: 40*th*–40*th* min


**Table 8.** Experienced emotions for transcript 1.4: 40*th*–40*th* min


We present another case study. For brevity, we omit the conversation excerpts of HL (machine analyzes only user texts) and show results for first part of conversation. Similar analysis was carried for the rest of the conversation.

### **Transcript 2.1 (0***th***–10***th* **min)**

user: okay, so I am 18 and my boyfriend is 17. He has BAD anger, it's never been anything physical. but he always gets mad over the littlest things and he always acts like everything bothers him when I say something wrong. . . but when he does something like that I am supposed to take it as a joke. and then he gets mad and tries to blow it off when I say something as a joke like "yep." "yeah." "nope I am fine." and acts short (Tables 9 and 10).



**Table 10.** Experienced emotion for transcript 2.1: 0*th*–10*th* min


#### **Transcript 2.2 (10***th***–20***th* **min)**

user: yeah i just need help getting through that it. yeah. . . and i'm worried with me going to college it'll get worse. I guess. . . it's just hard, not only that but my mom is freaking out on me and mad at me. all the time when i haven't done ANYTHING and that is really stressing me out. . . i don't know. . . i really don't she makes me feel lie i am a failure because i don have a job or anything and it doesn't help that she going through a change because she is 50. . . her and my little brother and stepfather constantly gang up on me. my brother is the worst. My boyfriend says i should leave since i am 18 but i have no where to go because i do not have a job nor any money (Tables 11 and 12).

#### *Machine Observations*

**Table 11.** Expressed emotions for transcript 2.2: 10*th*–20*th* min


**Table 12.** Experienced emotion for transcript 2.2: 10*th*–20*th* min


#### **4.2 Analysis**

*Validation with Human Experts:* In order to investigate the effectiveness of the algorithm, we asked human experts to state the top 3 emotions the user in any given transcript was experiencing. The human experts were chosen based on their knowledge and experience in the psychology of active listening. The experts were not restricted to use the same set of emotions as the machine could identify, instead they were free to mention anything they found appropriate. To compare with the machine's performance, we mapped similar emotion-describing words into the same category. For example, "anxious" was mapped to "worried". In 75% of the transcripts, the top emotion chosen by the evaluators matched with the top experienced emotion as computed by the machine. In the absence of ground truth (i.e., we did not have information from the user as to what they were experiencing), this accuracy is reasonable.

It is to be noted that with more information about the user (such as their conversation history), the machine will be able to uncover more hidden emotions. Also given that human evaluation itself was subjective, machine's result can serve as an additional source of information. For example, for the user in Transcript 2, the machine result suggested that sadness was the highest experienced emotion. Interestingly, none of the human experts identified sadness in the top 3 experienced emotions. However, given the situation of the user, it may not be unreasonable to say that sadness is likely underneath all her other emotions. *Understanding the User:* In this study, one of our goals was to understand the patterns of expressed and experienced emotions in users. Figure 1 is a plot of the highest expressed and experienced emotions at every time interval for the user in transcript 1. Throughout, the expressed emotion seems consistent with the experienced emotions. Also, there isn't any statistically significant difference between the degree of expressed and experienced emotions. Figure 2 is a plot of the lowest expressed and experienced emotions. Except for one time interval ( the last time interval wherein the lowest expressed emotion is worried and the lowest experienced emotion is fear), the lowest expressed and experienced emotions are the same, with no statistically significant difference in their intensity.

**Fig. 1.** Highest expressed and experienced emotions for user in Transcript 1.

**Fig. 2.** Lowest expressed and experienced emotions for user in Transcript 1.

Thus, this user is *mostly* expressing what s/he is experiencing. As another case study, consider the user in transcript 2. Figure 3 summarizes the *highest* expressed and experienced emotions for this user. Figure 4 shows the plot for *lowest* expressed and experienced emotions for this user. As can be noticed from Figs. 3 and 4, this user is *always* expressing what she is experiencing.

**Fig. 3.** Highest expressed and experienced emotions for user in Transcript 2.

**Fig. 4.** Lowest expressed and experienced emotions for user in Transcript 2.

There is generally a gap between what people express and what they experience. The aforementioned case studies were illustrations wherein one user *mostly* expressed what was experienced and the other *always* expressed what was experienced. However, there could be cases where people mostly hide certain emotions or never exhibit them. Thus, such quantitative studies of expressed and experienced emotions, can be useful in constructing "emotion profiles" of users. Emotion profiles can be thought of as some characteristic patterns exhibited by users in expressing and experiencing emotions. Understanding such details can help both the users as well as counselors assisting them. For example, if someone is scared, but only shows anger, it would be helpful to gently show (this user) that his/her underlying emotion is fear so that s/he can address it better. Such insights would also help a counselor in recommending suitable solution strategies.

# **5 Conclusions**

We presented an approach to understand relationship between expressed emotions and experienced emotions during the course of a conversation. Specifically, we evaluated the probability of a user experiencing an emotion based on the knowledge of their expressed emotions. We discussed how the relationship between the expressed and experienced emotions can be leveraged in understanding a user. Such an emotion analytic can be powerfully deployed in conversation platforms, that have machines or humans in the backend. We hope our findings will help in providing personalized solutions to end-users of a CUI by means of augmenting CUIs with an algorithmically guided emotional sense, which would help in having more effective conversations with end-users.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### Standard Co-training in Multiword Expression Detection

Senem Kumova Metin(&)

Department of Software Engineering, Faculty of Engineering, Izmir University of Economics, Sakarya Caddesi, No. 156, Izmir, Turkey senem.kumova@ieu.edu.tr

Abstract. Multiword expressions (MWEs) are units in language where multiple words unite without an obvious/known reason. Since MWEs occupy a prominent amount of space in both written and spoken language materials, identification of MWEs is accepted to be an important task in natural language processing.

In this paper, considering MWE detection as a binary classification task, we propose to use a semi-supervised learning algorithm, standard co-training [1] Co-training is a semi-supervised method that employs two classifiers with two different views to label unlabeled data iteratively in order to enlarge the training sets of limited size. In our experiments, linguistic and statistical features that distinguish MWEs from random word combinations are utilized as two different views. Two different pairs of classifiers are employed with a group of experimental settings. The tests are performed on a Turkish MWE data set of 3946 positive and 4230 negative MWE candidates. The results showed that the classifier where statistical view is considered succeeds in MWE detection when the training set is enlarged by co-training.

Keywords: Multiword expression Classification Co-training

### 1 Introduction

A learning machine and/or the task of learning requires experience in other words a training phase to learn. The method to obtain the experience puts the machine learning methods into 3 main categories: supervised, unsupervised and reinforcement learning algorithms. In supervised learning, a labeled data set is given to the machine during training. Following, the machine that gained the ability to label a given sample, may classify the testing samples. In unsupervised learning, the labels of the samples are not provided to the machine in training phase. The machine is expected to learn the structure of samples and varieties in unlabeled sample set and to extract the clusters it self. In reinforcement learning, the machine interacts with the dynamic environment and aims to reach a predefined goal. The training of the machine is provided by the rewards and penalties.

The supervised methods require a sufficient amount of labeled samples for training to achieve in classification of unlabeled data. However, in many problems it is not possible to provide that sufficient amount of labeled samples or preparation of such a sample set is over costing. In such cases, the machine may be forced to learn from unlabeled data. This is why, the notion of semi-supervised learning is defined as a halfway between supervised and unsupervised learning [2].

In semi-supervised learning methods, commonly training is performed iteratively. In first iteration, a limited number of labeled samples are given to the machine to learn. After first iteration, the machine labels the unlabeled samples. The samples that are labeled most reliably are added to the labeled set and the machine is re-trained by this enlarged labeled set in next iteration. After a number of iterations, it is accepted that the learning phase is finished and the machine is ready to label unlabeled data set. In other group of semi-supervised methods, some constraints are defined to supervise the training phase [2].

The earliest implementation of semi-supervised learning approach is probably the self-training [2]. In self-training, a single machine, trained by labeled sample set, enlarges its own labeled set iteratively, by labeling the unlabeled set. An alternative method to self-training, co-training, is proposed by Blum and Mitchell [1]. The co-training aims to increase the classification performance by employing two classifiers that considers different views of the data to label the unlabeled samples during training phase. There exist several implementations of the method that are used to solve different problems such as word sense disambiguation [3], semantic role labeling [4], statistical parsing [5], identification of noun phrases [6], opinion detection [7], e-mail classification [8] and sentiment classification [9].

In this study, we examine the effect of co-training in an important natural processing task: multiword expression detection. The notion of multiword expression may be explained in a variety of different ways. Simply, MWEs are word combinations where words unite to build a new syntactical/linguistic or semantic unit in language. Since the words may change their meaning or roles in text while they form MWE, detection of MWEs has a critical role in language understanding and language generation studies. For example, the expression "lady killer" is a MWE meaning "an attractive man". But if the meanings of the composing words are considered individually, the expression refers to something completely different. In MWE detection, it is believed that the links between the composing words of MWEs are stronger than the links between random combinations of words. The strength of these links is measured commonly by statistical and/or linguistics features that may be extracted from the given text or a text collection (e.g. [10–13]).

In a wide group of studies that aim identification of MWEs, the regarding task is accepted as a classification problem and several machine-learning methods are employed. For example, in [13] statistical features are considered together by supervised methods such as linear logistic regression, linear discriminant analysis and neural networks. In [12], multiple linguistically-motivated features are employed in neural networks to identify MWEs in a set of Hebrew bigrams (uninterrupted two word combinations). Several experiments are performed on Turkish data set with linguistics features by 10 different classifiers (e.g. J48, sequential minimization, k nearest neighbor) in [14].

In this study, we aim to examine the performance change in MWE recognition when co-training is employed. The paper is organized as following. We first present the semi-supervised learning and co-training in Sect. 2. In Sect. 3, experimental setup is given. In Sect. 4 results are presented. And the paper is concluded in Sect. 5.

### 2 Semi-supervised Learning: Co-training

Semi-supervised methods are proposed in order to overcome the disadvantages of supervised learning when there is a lack of sufficient amount of labeled samples. The methods are reported to succeed in some cases when some assumptions such as smoothness, clustering, manifold and transduction hold.

Semi-supervised methods are mainly categorized in four groups: generative, low-density, graph-based models and change of representation [2]. In generative models, the main aim is modeling the class conditional density. Co-training [1] and expected maximization [15] methods are well-known examples of generative models. On the other hand, low-density separation methods such as transductive support vector machine proposed by [16] try to locate decision boundaries in low density regions and away from the unlabeled samples. The methods presented in [17–19] are the examples of graph based methods where each node represents a sample and classification is performed by measuring the distance between nodes. In change of representation approach, a two-stage training is required. Since labeled samples are considered without their labels in the first stage, it is accepted that the representation of samples are changed by this way. In the second stage of training, unlabeled samples are excluded from the data set and supervised learning is performed with the new measure/kernel.

In this study, the semi-supervised method: co-training is implemented to identify MWEs. The co-training algorithm, given in Fig. 1, that will be named as standard co-training is proposed by [1].


Fig. 1. Standard co-training algorithm [1]

In standard co-training, the main aim is building a classifier trained by L number of labeled and U number of unlabeled samples where L is known to be a small number. In order to overcome the disadvantage of having a limited number of labeled samples, L, [1] proposed to split the feature vector in two groups of features where each group of features represents a different view of the regarding data set. Each group of features/split/view is used to train one of the classifiers. The assumptions that guarantee the success of co-training are explained as [1]


In several studies such as [6, 20], the researchers investigated to what degree these assumptions and the data set size effect the performance of co-training algorithm. For example, experimenting on the same problem mentioned in [1, 20] reported that even if the independency assumption is not satisfied, still co-training performs better than to alternatively proposed expected maximization algorithm since in each iteration all the samples are compared to others to determine the most confidently labeled ones in co-training.

The standard co-training algorithm is implemented to classify web pages in [1]. The first group of features is built by the words in web pages and the second group includes the words in the web links. In both classifiers, Naive Bayes algorithm is used and the tests are performed with p = 1 and n = 3. In [1], it is reported that the proposed co-training algorithm reaches to higher classification performance compared to supervised machine learning.

### 3 Experimental Setup

The experiments to examine performance of co-training in MWE detection require the following four tasks to be performed:


We propose to use linguistic and statistical features as two different views on MWE data set. In this study, the linguistic view includes 8 linguistic features listed below:

1. Partial variety in surface forms (PVSF\_m and PVSF\_n): In MWE detection studies, it is commonly accepted that MWEs are not observed in a variety of different surface forms in language. As a result, the histogram presenting the occurrence frequencies of different surface forms belonging to the same MWE is expected to be non-uniform [12]. We measured variety in surface forms in two different ways that are called as PVSF\_m and PVSF\_n features based on the surface form histogram, similar to [12]. Briefly, the Manhattan distance between the actual surface form histogram of the MWE candidate and the possible/expected uniform histogram is employed as PVSF\_m. The ratio of PVSF\_<sup>m</sup> to total occurrence frequency of the candidate (in any form) is accepted as PVSF\_n.


The statistical view includes 18 features (Table 1). These features are known to be commonly used in many studies (e.g. [10, 13, 21]). In Table 1, w<sup>1</sup> and w<sup>2</sup> represent the first and the second word in given MWE candidate, respectively.

In Table 1, Pðw1w2Þ is the probability of co-occurrence of two words w<sup>1</sup> and w<sup>2</sup> sequentially. <sup>P</sup>ðw1<sup>Þ</sup> and <sup>P</sup>ðw2<sup>Þ</sup> are the occurrence probabilities of first and the second words. PðwijwjÞ gives the conditional occurrence probability of the word wi given that the word wj is observed. fðw1w2Þ, fðw1Þ, fðw2Þ are occurrence frequency of the bigram w1w2, and the words w<sup>1</sup> and w<sup>2</sup> respectively. The different number of words following the bigram is represented by vf (w1w2), different number of words preceding and following the bigram is vb (w1w2) and vf (w1w2), respectively.

In this study, the classifiers SMO (Sequential Minimal Optimization) [22, 23], J48 [24] and logistic regression (Logistic) [25] are employed in classifier pairs as presented in Table 2. A Turkish MWE data set that includes 8176 samples of MWE candidates


Table 1. Statistical features

(3946 positive (MWE labeled) and 4230 negative (non MWE labeled)) is utilized in experiments.

Table 3 presents the sizes of labeled (L), unlabeled (U) and test (T) data sets. For example, in experimental setting no 1, 50 samples are used in labeled set, unlabeled set has 250 samples and test size is set as 100.

The evaluation of the classification is performed by F1 measure. F1 measure is given as

$$\text{F1} = \frac{2TP}{2TP + FN + FP} \tag{1}$$


Table 2. Classifier pair


Table 3. Data sets


where TP is the number of true positives (candidates that are both expected and predicted to belong to the same class MWE or non-MWE), FN is the number of false negatives, FP is the number of false positives.

### 4 Results

The performance of standard co-training, given in Fig. 1, is examined on test settings by repeating the same experiment 5 times (5 runs) for each setting. The numbers of positive (p) and negative samples (n) that will be inserted to the labeled data set in each iteration are set to one. And in each run of the tests, the data set is shuffled to build the labeled L, unlabeled U and test sets randomly. Table 4 gives the average evaluation results of the regarding tests. In Table 4,



Table 4. Testing results of standard co-training.

The shaded regions in Table 4 show the settings in which Fi Fc, meaning that when training set is enlarged with co-training, F1 value increases. It is observed that standard co-training succeeds for all settings in statistical classifier. The cells that hold bold F1 values represent the settings where Fc Fs, meaning that the training set that is enlarged by co-training is more successful in supervising the classifier when compared to the same data set with human annotated labels of samples.

Table 5 gives minimum, average and maximum F1 values of both classifiers for three different cases:



Table 5. F1 results with and without co-training.

From Table 5, three important outputs are observed. These are:


### 5 Conclusion

In this study, we present our efforts to improve the performance of MWE detection by the use of standard co-training algorithm. The results showed that especially for the classifier that employs statistical features in classification, performance is improved by co-training. As a further work, we plan to apply different versions of co-training and run the tests with different types of classifiers.

Acknowledgement. This work is carried under the grant of TÜBİTAK – The Scientific and Technological Research Council of Turkey to Project No: 115E469, Identification of Multi-word Expressions in Turkish Texts.

We thank to Mehmet Taze, Hande Aka Uymaz, Erdem Okur and Levent Tolga Eren for their efforts in labeling MWE data set.

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Comparative Study on Normalisation in Emotion Recognition from Speech**

Ronald B¨ock(B) , Olga Egorow, Ingo Siegert, and Andreas Wendemuth

Cognitive Systems Group, Otto von Guericke University Magdeburg, Universit¨atsplatz 2, 39106 Magdeburg, Germany ronald.boeck@ovgu.de http://www.cogsy.de

**Abstract.** The recognition performance of a classifier is affected by various aspects. A huge influence is given by the input data pre-processing. In the current paper we analysed the relation between different normalisation methods for emotionally coloured speech samples deriving general trends to be considered during data pre-processing. From the best of our knowledge, various normalisation approaches are used in the spoken affect recognition community but so far no multi-corpus comparison was conducted. Therefore, well-known methods from literature were compared in a larger study based on nine benchmark corpora, where within each data set a leave-one-speaker-out validation strategy was applied. As normalisation approaches, we investigated standardisation, range normalisation, and centering. These were tested in two possible options: (1) The normalisation parameters were estimated on the whole data set and (2) we obtained the parameters by using emotionally neutral samples only. For classification Support Vector Machines with linear and polynomial kernels as well as Random Forest were used as representatives of classifiers handling input material in different ways. Besides further recommendations we showed that standardisation leads to a significant improvement of the recognition performance. It is also discussed when and how to apply normalisation methods.

### **1 Introduction**

The detection of affective user states is an emerging topic in the context of human-computer interaction (HCI) (cf. [19,24]), as it is known that besides the pure context additional information on the user's feelings, moods, and intentions is transmitted during communication. For instance [1] discussed that such information should be used in HCI for a more general view on the human interlocutor.

The detection of emotions from speech can be seen as a challenging issue since both, the emotions themselves as well as the way humans utter emotions, introduce variations increasing the difficulty of a distinct assessment (cf. [2,24]). Furthermore, many up-to-date classification methods analyse data based on the distances between the given sample points (cf. [24]). As a consequence of the aforementioned aspects, a data handling which scales the given samples in a comparable way has to be considered, leading to the question of data normalisation before classification. Yet, there are many approaches for data normalisation available (cf. e.g. [26] pp. 45–49) which are used in various studies.

The paper's aim is to investigate and to compare the different normalisation methods and to deduce in which situation they perform best. Since we were mainly interested in the *general trend* of the recognition results we did not argue on pure classification results, but *derived more general statements*. We are aware that a highly optimised classifier outperforms the systems presented in this paper. Nevertheless, in such cases, it is hard to identify general statements we are looking for. Therefore, the presented analyses are based on six normalising methods, dominantly used in the literature, applied to nine benchmark corpora well-known in the community of speech based emotion recognition.

The investigation is guided by the following research questions: **Q1:** Which normalising methods are usually applied in the community? **Q2:** Which normalisation approach provides the best recognition results? **Q3:** At which point can and shall normalisation be applied to the data? **Q4:** Can we derive recommendations stating which method(s) shall be used to achieve a reasonable improvement in the emotion recognition from speech?

*Related Work.* Normalisation is a pre-processing step which is applied to given material to handle differences caused by various circumstances. According to our knowledge, no comparison study on different normalisation methods based on several benchmark corpora was conducted for emotion recognition from speech. Nevertheless, various approaches are used in the community which are the foundations of this paper. Furthermore, we found that in the literature a heterogeneous terminology is used (cf. e.g. [15,31]). Therefore, we will use in the following a unique naming of normalisation methods.

In general, two papers present an overview on normalisation: in [26] normalisation techniques in the context of speaker verification are presented. For emotion recognition from speech, we found a rather brief overview in [31], highlighting that the same names often refer to different normalisation approaches.

Regarding the different normalisation techniques, the most prominent version is the standardisation (cf. [31]), although it is often just called normalisation. In most cases, papers refer to z-normalisation (cf. [7,9,16,21,22,25]) and further, to mean-variance-normalisation (cf. [29]).

Range normalisation and centering are, to the best of our knowledge, just used in the work of [15,31]. In [31], the authors applied these methods only on six data sets (a subset of corpora presented in Table 1) considering only two affective states and further, they do not vary the classifier.

Another approach highlighted in [15] is the normalisation based on neutral data. This idea is invented in [3], and further elaborated in [4]. In [15], the authors apply this approach on all three presented normalisation methods. As this is a promising approach keeping the differences between various affective states (cf. [3]), we included it in our experiments as well.

Several papers like [11,24,30] do not use any normalisation at all. This practice is related to the statement that "[f]unctionals provide a sort of normalisation over time" [24], assuming that normalisation is implicitely provided by the selected features mainly based on functionals.

In general, the presented works vary in approaches of normalisation, classification techniques, and utilised corpora. Therefore, a direct comparison of results is quite difficult for readers. The closest related papers for comparison are [21,31], as they refer to subsets of the benchmark corpora we analysed. Otherwise, as we were interested in the general characteristics of the normalising methods, we thus did not opt on fully optimised recognition results.

### **2 Data Sets**

This study is focussed on the influence of normalisation approaches on the classification performance. Therefore, we decided to apply the various methods described in the literature to data sets widely used in the community. To cover various characteristics in the experiments, the corpora provide material in various languages, speaker ages and sexes as well as different emotional classes. Further, the material is recorded under different conditions reflecting acted and spontaneous (acoustic) expressions. The individual characteristics of each data set are presented in Table 1 and will be briefly introduced<sup>1</sup> in the following.

**Table 1.** Overview of the selected emotional speech corpora characteristics including information on number of classes (# C.) and if the corpus provides material for neutral speech (Neu.).


The *Airplane Behaviour Corpus (ABC)* (cf. [23]) is developed for applications related to public transport surveillance. Certain moods were induced using a predefined script, guiding subjects through a storyline. Eight speakers – balanced in sex – aged from 25–48 years (mean 32 years) took part in the recording. The 431 clips have an average duration of 8.4 s presenting six emotions.

<sup>1</sup> The explaining text for each corpus is inspired by [27].

The *Audiovisual Interest Corpus* (AVIC) (cf. [20]) contains samples of interest. The scenario setup is as follows: A product presenter leads each of the 21 subjects (ten female) through an English commercial presentation. The level of interest is annotated for every sub-speaker turn.

The *Danish Emotional Speech (DES)* (cf. [8]) data set contains samples of five acted emotions. The data used in the experiments are Danish sentences, words, and chunks expressed by four professional actors (two females) which were judged according to emotion categories afterwards.

The *Berlin Emotional Speech Database (emoDB)* (cf. [2]) is a studio recorded corpus. Ten (five female) professional actors utter ten German sentences with emotionally neutral content. The resulting 492 phrases were selected using a perception test and contain in seven predefined categories of acted emotional expressions (cf. [2]).

The *eNTERFACE* (cf. [18]) corpus comprises recordings from 42 subjects (eight female) from 14 nations. It consists of office environment recordings of pre-defined spoken content in English. Overall, the data set consists of 1277 emotional instances in six induced emotions. The quality of emotional content spans a much broader variety than in emoDB.

The *Belfast Sensitive Artificial Listener (SAL)* (cf. [6]) corpus contains 25 audio-visual recordings from four speakers (two female). The depicted HCIsystem were recorded using an interface designed to let users work through a continuous space of emotional states. In our experiments we used a clustering provided by [21] mapping the original arousal-valence space into 4 quadrants.

The *SmartKom* (cf. [28]) multi-modal corpus provides spontaneous speech including seven natural emotions in German and English given a Wizard-of-Oz setting. For our experiments, we used only the German part.

The *Speech Under Simulated and Actual Stress* (SUSAS) (cf. [14]) dataset contains spontaneous and acted emotional samples, partly masked by field noise. We chose a corpus' subset providing 3593 actual stress speech segments recorded in speaker motion fear and stress tasks. Seven subjects (three female) in roller coaster and free fall stress situations utter emotionally coloured speech in four categories.

The *Vera-Am-Mittag* (VAM) corpus consists of audio-visual recordings taken from a unscripted German TV talk show (cf. [12]). The employed subset includes 946 spontaneous and emotionally utterances from 47 participants. We transformed the continuous emotion labels into four quadrants according to [21].

# **3 Normalising Methods**

We reviewed the literature according to normalisation methods utilised in speech based emotion recognition and found four main approaches, but no direct comparison amongst them. Furthermore, it can be seen that the utilised methods are named differently by various authors although employing the same approaches. Therefore, we structured the methods and harmonised the naming.

Generally, we defined x as the input value representing, for instance, a speech feature, μ as the corresponding mean value, and σ as the corresponding variance.

*Standardisation* is an approach to transform the input material to obtain standard normally distributed data (μ = 0 and σ = 1). The method is computed as given in Eq. 1.

$$x\_s = \frac{x - \mu}{\sigma} \tag{1}$$
 
$$\text{Allo normalization and is thus often confined}$$

*Range Normalisation* is also called normalisation and is thus often confused with common standardisation. Therefore, we chose the term *range normalisation* that implies the possibility to vary the transformation interval. In Eq. 2 the interval is specified by [a, b] and further <sup>x</sup>min and <sup>x</sup>max are the minimal and maximal values per feature. In contrast to standardisation (cf. Eq. 1) the mean and variance are not used by the approach.

$$x\_n = a + \frac{(x - x\_{\min})(b - a)}{x\_{\max} - x\_{\min}} \tag{2}$$

In our experiments we chose the interval [*−*1, 1] for range normalisation.

The *Centering* approach frees the given input data from the corresponding mean (cf. Eq. 3). Therefore, the transformation results in a shift of input data.

$$x\_c = x - \mu \tag{3}$$

*Neutral Normalisation* is an approach where normalisation parameters are computed based on neutral data, only. It is described in [4], and a logical extension of the idea to use neutral speech models for emotion classification (cf. [3]). Neutral normalisation is used for normalisation purpose in [15]. The methods works as follows: The parameters <sup>μ</sup> and <sup>σ</sup> or <sup>x</sup>min and <sup>x</sup>max, respectively, for each feature are obtained based on the samples annotated as neutral and are further applied on samples with other emotional impressions. In our experiments this was done separately for each aforementioned normalisation method, namely standardisation, range normalisation, and centering.

*Application* of normalisation methods is as follows: The described normalising methods were applied to the training material as well as to the testing samples. For the test set two practices are possible and both were examined in our experiments. The first option assumed that both sets are known. Therefore, each set can be normalised separately, where accordingly optimal parameters (i.e. μ and σ, for instance) were used. In the second option, the necessary parameters were extracted only on the training set and applied to the testing set. In this case, it is assumed that the test samples are unknown, and thus no parameter estimation can be previously operated.

### **4 Experimental Setup**

To evaluate the influence of normalisation, we conducted a series of classification experiments. Since one of our objectives was to obtain *reproducible* results comparable to other studies, we decided to employ established feature sets and classifiers.

The *emobase* feature set is well-known in the community of emotion recognition from speech. This set comprises 988 functionals (e.g. mean, minimum, maximum, etc.) based on acoustic low-level descriptors (e.g. pitch, mel-frequency cepstral coefficients, line spectral pairs, fundamental frequency, etc.) [10]. The features are extracted on utterance level, resulting in one vector per utterance.

We decided to employ two different kinds of *classifiers*: the distance-based Support Vector Machine (SVM) and the non-distance-based Random Forest (RF). We expected that normalisation would provide significant improvement if using SVM, and no or only little improvement if using RF. For SVM, we used the LibSVM implementation developed by [5] implemented in WEKA [13]. For RF, we also rely on WEKA.

Since the data sets used in the experiments are very diverse, it would be difficult to impossible to fine-tune the classifiers to fit all the data. Therefore, we decided to use standard parameters for both, SVM and RF, without further fine-tuning. In the case of SVM, we chose a linear kernel (referred to as lin-SVM) and a polynomial kernel with a degree of 3 (referred to as pol-SVM), both with cost parameter C = 1.0. In the case of RF, we used 32 features per node, as the square root of the number of input features (in our case 988) is often used as default value in different RF implementations, and 1000 trees.

We evaluated the classifiers in a Leave-One-Speaker-Out (LOSO) manner, using the Unweighted Average Recall (UAR) of all emotions per speaker as evaluation metric.

# **5 Results**

Figure 1 shows the results at a glance for lin-SVM on two of the nine investigated corpora (ABC and eNTERFACE). For the ABC corpus, we could see that some normalising methods such as standardisation performed better than others for nearly all speakers. For the eNTERFACE corpus, we see that the performance of the same normalising method varies remarkably depending on the speaker.

**Table 2.** Classification results (UAR, averaged over all nine corpora, in %) for all normalising methods (NN - non-normalised, S(-neu) - standardisation (with neutral), RN(-neu) - range normalisation (with neutral), C(-neu) - centering (with neutral)). The best classification result is highlighted for each classifier.


**Fig. 1.** UAR per speaker in (a) ABC and (b) eNTERFACE for lin-SVM.

In Table 2, the results are shown in a more detailed way, comparing the mean UAR, averaged over all nine corpora for all normalising methods and classifiers. For two of the three classifiers, standardisation outperformed other methods – and in the case of lin-SVM, neutral standardisation worked even better. Also, we see that standardisation and neutral standardisation were the only two normalising methods that always led to an improvement of the classification results.

An interesting point could be found by looking at the mean and standard deviation of all normalising methods presented in Table 2: For both SVM classifiers, normalising data in any kind changed the results (on average, +4.1% for lin-SVM and *−*4.5% for pol-SVM, absolute) more than in the case of RF (only 0.2%). There were also noticeable differences between the normalising methods, resulting in a higher standard deviation for both SVM classifiers compared to RF. Both observations support our hypothesis that in the case of SVM, changing the distance between data points by applying any normalising method would influence the classification results, whereas in the case of RF, normalisation would not change the classification results significantly.

There is another interesting point concerning the results using pol-SVM: Applying range normalisation significantly impairs the classification, leading to an UAR drop of 14.5% absolute. Our hypothesis concerning this phenomenon was that there is a non-linear effect induced by the combination of the polynomial kernel and high-dimensional data. To investigate this phenomenon, we conducted a series of additional experiments using polynomial kernels of increasing degrees. The results are shown in Table 3. We could see that the increasing degree of the kernel led to a drop in performance – for higher degrees the performance



decreases to chance level. This effect does not occur on non-normalised data, so we could conclude that it is related to or caused by range normalisation.

For a closer look on multi-corpus evaluation, the classification results in terms of UAR, obtained employing lin-SVM, are presented in Table 4. Since the data was not normally distributed, we executed the Mann-Whitney-U-Test (cf. [17]) to calculate significance for all classification outcomes. For five of the nine corpora, the improvements of normalised over non-normalised data were statistically significant (p < <sup>0</sup>.1). But even for the cases where the improvements were not significant, normalising data led to at least some improvements: For all corpora except SAL, standardisation or standardisation on neutral data achieves the best results (cf. Table 4). In the case of SAL, range normalisation achieved the best results – but is only 0.2% better than standardisation. Otherwise, using inappropriate normalising methods could also impair the results. For example, in the case of AVIC, eNTERFACE, and SUSAS, all normalising methods except for standardisation led to minor decreases, although not statistically significant.

**Table 4.** Results achieved (UAR in %) using lin-SVM on normalised data and nonnormalised baseline. Best results are highlighted gray, results below the baseline are given in *italic*. Significance levels: \*\*\*p *<* 0.01, \*\*p *<* 0.05, \*p *<* 0.1


Concerning normalising training and test set either using independently calculated parameters or using parameters calculated on both data sets, we could state that there is no significant difference in terms of UAR. There were some fluctuations in the results depending on the considered corpus, but the differences occurred in both directions and did not show a trend towards one option, and they were within the standard deviation. For example, in the case of AVIC, the maximum difference in the UAR achieved using independent versus combined parameters is 1.5% in favour of the former – with a standard deviation of 6.6% and 8.3% for independently and non-independently calculated normalisation parameters, respectively.

### **6 Discussion**

In the current section the experimental results (cf. Sect. 5) are reflected considering the questions Q1 to Q4.

For question Q1, we analyse various works reflecting the state-of-the-art in the community (cf. Sect. 1). From these, we find that mainly two different approaches are used, namely standardisation and (range) normalisation. Less frequently centering is applied to data sets for normalisation purposes. Further, as presented in [3], the normalisation parameters can also be estimated based on emotionally neutral samples. This is tested in our experiments as well. We also find a slight trend towards standardisation in the literature.

Given this overview, we select the three most prominent methods for the experiments, namely standardisation, range normalisation, and centering (cf. Sect. 3). Further, they are also applied in the context of neutral normalisation if possible. Based on our results, the aforementioned trend towards standardisation is valid, since for eight benchmark corpora (cf. Table 1) standardisation produces an improvement in the recognition performance. The same statement holds for neutral normalisation, where standardisation shows the best performance as well (cf. question Q2). In our experiments we apply the LOSO validation strategy. Therefore, we have the opportunity to analyse the recognition performance in a speaker-independent way. As shown in Fig. 1 for ABC and eNTERFACE, the recognition results depend on the speaker to be tested. Of course, this effect is seen on the other corpora as well. Nevertheless, we find a relation between normalisation methods and the performance. For corpora containing mainly acted speech samples, a clustering of particular normalisation methods can be seen (cf. the gap between lines in Fig. 1(a)). In contrast for data sets providing more spontaneous emotions such clustering is not feasible. Further, the different methods are closer to each other in absolute numbers (cf. Fig. 1(b)). From our point of view, this is related to the lower expressivity of emotions uttered in spontaneous conversations, and hence, no particular normalisation approach is able to improve the recognition performance. As presented in Table 4, we can conclude that standardisation provides the best results across the nine benchmark corpora. In the case of SAL, range normalisation outperforms standardisation by 0.2%, absolute, only. Based on the Mann-Whitney-U-Test, we show that the improvement of recognition performance is significant for five corpora (at least p < <sup>0</sup>.1). For this, we test the significance against the non-normalised classification as well as against the second best results if the difference is low (cf. e.g. SmartKom in Table 4). This statistical significance emphasises the importance of suitable normalisation during the classification process.

Regarding the question how the normalisation shall be applied (cf. Q3), we tested two possible options: For the first one, the test set is normalised independently from the training set, for the second one, we normalise the test set using parameters obtained on the training set. The final results show that the differences in the recognition results are marginal with no statistical significance for either method. Therefore, both options are useful for testing purposes, and thus there is no need to refrain from using separately normalised test samples.

From our experiments, we can derive some recommendations for the application of normalisation approaches (cf. question Q4). First, in a multi-corpus evaluation based on a LOSO strategy standardisation is reasonable since in most cases (six of nine) this leads to a (significant) improvement of classification performances. This is also an indicator that normalisation improves even classification results based on feature sets mainly consisting of functionals (cf. *emobase* in Sect. 4). From our perspective this levels the statement of [24] that functionals already provide a kind of normalisation. Secondly, there is no need to favour either handling approach for test sets as no statistical significance in the differences in performance can be seen. Finally, the classifier influences the effect obtained by normalisation as well. From Tables 2 and 3 we can see that lin-SVM achieved better results than the other two classifiers across corpora. For RF, it was expected that normalisation has almost no influence since the classification is not distance based, resulting in lower standard deviations across corpora (cf. Table 2). In contrast, pol-SVM collapses with higher degrees (cf. Table 3), especially in the case of using range normalisation. We assume that this is related to a non-linear effect between the polynomial degree and the normalisation method. This will be further elaborated in future research.

# **7 Conclusion**

In this paper, we have shown that normalising data in emotion recognition from speech tasks can lead to significant improvements. The extent of these improvements depends on three factors – these are the *general trends* we already discussed in Sect. 1. First of all, we have shown that standardisation works best in almost all cases: Applying it improved the recognition results for all nine corpora, for six corpora it proved to be the best normalising method. Secondly, the results depend on the used classifier: We have shown that, using lin-SVM, significant improvements are possible when applying standardisation as well as range normalisation. But for pol-SVM, range normalisation does not work well. The final factor is the data itself: For some corpora such as emoDB, improvements of up to 30% absolute are possible, for other corpora like SmartKom, only slight improvements of less than 3% absolute are achieved. From these findings we can conclude that standardisation in most cases leads to substantially improved classification results.

**Acknowledgments.** We acknowledge continued support by the Transregional Collaborative Research Centre SFB/TRR 62 "Companion-Technology for Cognitive Technical Systems" (www.sfb-trr-62.de) funded by the German Research Foundation (DFG). Further, we thank the project "Mod3D" (grant number: 03ZZ0414) funded by 3Dsensation (www.3d-sensation.de) within the Zwanzig20 funding program by the German Federal Ministry of Education and Research (BMBF).

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Detecting Vigilance in People Performing Continual Monitoring Task**

Shabnam Samima1(B), Monalisa Sarma<sup>1</sup>, and Debasis Samanta<sup>2</sup>

<sup>1</sup> Subir Chowdhury School of Quality and Reliability, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India {shabnam.samima,monalisa}@iitkgp.ac.in <sup>2</sup> Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India Debasis.samanta.iitkgp@gmail.com

**Abstract.** Vigilance or sustained attention is an extremely important aspect in monotonous and prolonged attention seeking tasks. Recently, Event Related Potentials (ERPs) of Electroencephalograph (EEG) have garnered great attention from the researchers for their application in the task of vigilance assessment. However, till date the studies related to ERPs and their association with vigilance are in their nascent stage, and requires more rigorous research efforts. In this paper, we use P200 and N200 ERPs of EEG for studying vigilance. For this purpose, we perform Mackworth's clock test experiment with ten volunteers and measure their accuracy. From the measured accuracy and recorded EEG signals, we identify that amplitude of P200 and N200 ERPs is directly correlated with accuracy and thereby to vigilance task. Thus, both P200 and N200 ERPs can be applied to detect vigilance (in real-time) of people involved in continuous monitoring tasks.

**Keywords:** Vigilance detection *·* Attention monitoring *·* Human errors Brain computing interface *·* Event related potential *·* EEG signals

### **1 Introduction**

According to Mackworth, "Vigilance is defined as a state of readiness to detect and respond to small changes occurring at random time intervals in the environment" [1]. In other words, vigilance or sustained attention is an act of careful observation of critical or rare events whose negligence may lead to catastrophe [2]. In today's world, where emphasis is laid on reducing risks and errors, and mitigating the chances of accidents, it seems rational to assess the operator vigilance in real time to avoid human errors. Air traffic control, drowsiness detection in drivers, inspection and quality control, automated navigation, military and border surveillance, life-guarding, cyber operations, space exploration, etc., [3], are some major domains where operators are involved in monotonous tasks for prolonged intervals of time and remaining vigilant is an utmost requirement. However, in [4,5], it has been pointed that sleep deprivation, work overload, stress, time pressure, drowsiness and prolonged working hours are the major factors that lead to low vigilance, thereby, human errors.

Till date several concerted efforts have been made in the literature to propose and design new techniques of vigilance detection with the help of features like, heart rate variability [6], galvanic skin response [6], pupil diameter, eye blink frequency [3] and brain activity measurement [7–9] (namely, EEG (Electroencephalogram), MEG (Magnetoencephalogram), fNIRS (functional near infrared spectroscopy), ECoG (electrocorticogram), fMRI (functional magnetic resonance imaging), etc.). Although, the techniques mentioned above are good contender for vigilance detection, yet, they have several serious limitations associated with them. For instance, eye related features show strong inter-personal and intrapersonal variability, EEG suffers from poor spatial resolution, MEG requires special operating environment for its functioning, ECoG involves implantation of electrodes in an invasive manner, fMRI is associated with high equipment overhead and fNIRS suffers from low spatial resolution.

Amongst the above-mentioned methods, designed for vigilance detection, EEG is the most commonly studied physiological measure despite of its poor spatial resolution. The prime reasons behind its tremendous popularity amongst researchers are: (1) its high time resolution, (2) its non-invasive nature and simplicity of operation and (3) relatively cheap cost compared to other devices. Furthermore, as vigilance deteriorates with time it seems plausible to study the brain signals in time bound fashion to assess the vigilance status in real-time. In this regard, the Event Related Potentials (ERPs) present in the EEG signals have successfully been utilized to study the changes occurring in the human brain with passing time [10]. For instance, ERP features namely P100-N200 have been utilized for studying emotional information processing in [11]; frontal midline theta and N200 ERP have been shown to reflect complementary information about expectancy and outcome evaluation in [12]; in [13] authors utilized N200 ERP for word recognition; in [14], N100, P200, N200 and P300 ERP components have been used to study the impact of depression on attention. Further, ERPs have also been used for understanding reaction times in response to pictures of people depicting pain [15]; in [16] ERPs have been utilized to understand the state of brain in schizophrenia patients; in [17] authors demonstrated the association of mMMN, P200 and P500 ERP components with artificial grammar learning in the primate brain; in [18], N400 and the P200 components have been utilized in the investigation of semantic and phonological processing in skilled and less-skilled comprehenders; besides, ERPs have also found utility in studying multisensory integration (MSI) ability of the brain in school-aged children [19].

From the above literature, we observe that P200 and N200 ERPs (see Fig. 1) have been instrumental in studying cognitive behaviour of humans and is prospective for real-time assessment of vigilance. Here, concisely P200 ERP refers to a positive spike in EEG signals which is generally observed within 150 to 250 ms after the exhibition of a target stimulus (auditory or visual event) [20], while N200 is a negative potential usually evoked between 180 to 325 ms after the presentation of a specific visual or auditory stimulus following a string of standard (non-target) stimuli [21,22]. In general, P200 latency is a measure of stimulus classification speed and its amplitude represents the amount of attentional resources devoted to the task along with the required degree of information processing, whereas N200 ERP, which is usually evoked only during conscious stimulus attention before the motor response, is helpful in stimulus identification and distinction, thereby suggesting its link to the cognitive processes.

**Fig. 1.** P200 and N200 components in ERP signal of EEG data

In this work we propose (a) to use N200 and P200 ERPs for studying vigilance, (b) observe the correlation of N200 and P200 ERPs with behavioural accuracy obtained, (c) observe the variation in the amplitude of both N200 and P200 ERPs under the presence of target and non-target stimuli, (d) observe the variation in the active areas of the brain before, during and after the experiment and check whether the hotspots are present in the areas from which P200 and N200 evoke.

# **2 Proposed Methodology**

In the following, we present our proposed research methodology and steps for extracting ERPs (P200 and N200) from the EEG signals.

#### **2.1 Experimental Setup**

**Subjects:** Ten healthy, right handed participants with normal or correctedto-normal vision, aged between 26 to 33 years volunteered for the experiment (see Table 1). To carefully monitor the vigilance of each volunteer, a proper schedule was maintained. It was ensured that the participants: (a) were not sleep deprived, (b) were under no medication and (c) had no history of mental illness. We also took written consent from each participant, which was approved by the institution's ethical committee, before conducting the experiment. Further, we asked each volunteer to do not consume tea or coffee 3 to 4 h prior to the experiment. Keeping in mind the usual circadian cycle of activeness of each participant, the experiment was conducted in the morning, that is between 7 am and 10 am.


**Table 1.** Participant details

**Vigilance Task:** To study the variation of vigilance over a long period of time, we utilized the computerized version of the *M ackworth Clock T est* as the experimentation tool, wherein the small circular pointer moves in a circle like the seconds' hand of an analog clock. It changes its position approximately after one second. However, at infrequent and irregular intervals, the pointer can make a double jump. Here, the task of each participant is to detect and respond to the double jump of the pointer, indicating the presence of the target event, by pressing the *space bar* key of the keyboard.

### **2.2 Protocol**

The participants were comfortably seated in a quiet and isolated room (devoid of any Wi-Fi connections), wherein a constant room temperature was also maintained. Before conducting the actual experiment, each participant was given proper demonstrations and instructions about the experiment and were asked to relax for ten minutes. Further, a practice session of five minutes was also arranged for each participant to make them accustomed to the task. We utilized a large 20 in. monitor kept at a distance of 65 cm from the user for presenting the visual stimuli to the participant. The beginning of the experiment was marked by an EEG recording of an idle session of five minutes followed by the clock test of 20 min. There were a total of 1200 trials in the experiment. After completion of the clock test, we again recorded the EEG signals for an idle session of five minutes. Besides, to keep track of a participant's responses and to ensure true marking of the target events, we also recorded the hardware interrupt from the keyboard. The entire experimental procedure has been pictorially shown in Fig. 2.

**Fig. 2.** The overall experimental procedure

### **2.3 Data Acquisition**

The experiment was designed to be completed in 30 min. Further, all EEG data recordings were carried out with the help of portable, user friendly and cost effective Emotiv Epoc+ device which follows the well-known 10–20 international system. This device comprises of 14 electrodes positioned at AF3, F7, F3, FC5, T7, P7, O1, O2, P8, T8, FC6, F4, F8 and AF4 locations and has a sampling rate of 128 Hz. We collected 12000 trials for our experiment with the help of ten voluntary participants.

### **2.4 Detection of ERPs**

1. *Pre-processing*: Usually while recording EEG data, due to various external environmental disturbances, the data gets contaminated with various kinds of artifacts. The extraction of desired/useful features from the EEG signal becomes very difficult under the presence of artifacts. Hence, to minimize the effect of artifacts, it is mandatory to pre-process the recorded raw EEG signals. For this purpose, filters in the standard frequency range of (0.1–30 Hz) are used. Thus, filters help in extracting the desired brain activity by rejecting the other undesired brain signals within a frequency range of (*<*0.1 Hz and *>*30 Hz). In the present work, we have used the Chebyshev's high pass filter (having cut off frequency of 0.1 Hz) to remove all disturbing components emerging from breathing and voltage changes in neuronal and nonneuronal artifacts. Besides, we used Chebyshev's low pass filter (having cut off frequency of 30 Hz) to eliminate the noise arising from muscle movements. Further, to ensure perfect rejection of the strong 50 Hz power supply interference, impedance fluctuation, cable defects, electrical noise, and unbalanced impedances of the electrodes, we utilized (at the recording time) a notch filter with null frequency of 50 Hz.

2. *Feature Extraction*: It is known from the literature that P200 and N200 ERPs are dominant over parietal, occipital and frontal regions, respectively, of the brain. Thus to locate these features, we have used the AF3, AF4, F3, F4, P7, P8, O1 and O2 electrodes. Now, for extracting the features from the EEG signals, first, the pre-processed EEG data is marked to identify the type of event (that is, correctly identified event, falsely identified event and missed event). Next, baseline removal process is carried over this marked data, followed by epoch averaging (500 ms pre-stimulus and 1000 ms post-stimulus) to generate the ERP waveforms. Furthermore, to verify the presence of P200 and N200 ERPs we performed ensemble averaging of the target event epochs and plotted the average waveform.

### **3 Results and Discussions**

The recorded EEG data, of 20 min, has been divided into 10 equal observation periods of two minutes each to carefully observe the pattern of vigilance changes. Next, we observed the amplitude and latency variation of P200 and N200 component of ERPs for those instances where the user responded correctly to an occurrence of the target event. Further, we compared the accuracy attained by each individual, while focusing on the pointer of the clock test and trying to correctly detect the target, with the amplitude of P200 and N200 ERPs to establish a correlation amongst them. The variation of amplitude and latency of P200 and N200 has been reported in Tables 2 and 3, respectively. The amplitude ranges for P200 and N200 ERP are heuristically defined as follows:

$$P200\text{ (amplitude)} = \begin{cases} \text{very low}, & \text{for } value \geqslant 0.1 \text{ }\mu\text{V } and < 1 \text{ }\mu\text{V} \\ low, & \text{for } value \geqslant 1 \text{ }\mu\text{V } and < 3 \text{ }\mu\text{V} \\ moderate, & \text{for } value \geqslant 3 \text{ }\mu\text{V } and < 7 \text{ }\mu\text{V} \\ high, & \text{for } value \geqslant 7 \text{ }\mu\text{V} \end{cases} \tag{1}$$

$$N200 \text{ (amplitude)} = \begin{cases} \text{very low}, & \text{for } value \ge -0.01 \text{ }\mu\text{V } and < -1 \text{ }\mu\text{V} \\ \text{low}, & \text{for } value \ge -1 \text{ }\mu\text{V } and < -3 \text{ }\mu\text{V} \\ \text{moderate}, & \text{for } value \ge -3 \text{ }\mu\text{V } and < -6 \text{ }\mu\text{V} \\ \text{high}, & \text{for } value \ge -6 \text{ }\mu\text{V} \end{cases} \quad (2)$$

To evaluate the performance of the participants in terms of accuracy of detection, we sub-divided the recorded EEG data into four categories defined as *true alarm* (*T A*), *true skip* (*T S*), *false alarm* (*F A*) and *false skip* (*F S*). In terms of Mackworth Clock test experiment, *true alarm* represents correct identification of target stimuli, *true skip* represents correct identification of non-target stimuli,


**Table 2.** Variation of amplitude and latency of P200 ERP

**Table 3.** Variation of amplitude and latency of N200 ERP


*false alarm* represents incorrect key pressed at non-targets and *false skip* represents non-identification of the target stimuli. Based on these data, the accuracy is calculated by using Eq. 3. The accuracy of detection of each individual who participated in the experiment has been tabulated in Table 4. The latencies of P200 and N200 ERPs were observed to be within the already known ranges; however, no particular trend with respect to amplitude has been observed for the obtained latencies.

$$Accuracy = \frac{TA + TS}{TA + TS + FA + FS} \tag{3}$$

The accuracy (in %) obtained is divided into four classes which is defined as follows:

> *Accuracy* (%) = ⎧ ⎪⎪⎪⎨ ⎪⎪⎪⎩ *very low,* for *value* - 0% *and <* 30% *low,* for *value* - 30% *and <* 50% *moderate,* for *value* - 50% *and <* 80% *high,* for *value* - 80% *and* 100% (4)


**Table 4.** Variation of accuracy (in %) of each participant

From Tables 2, 3 and 4, and using Eqs. 1, 2 and 4, we can see that for participant P2, during an observation interval between (4–6) min the amplitude of P200 is high (12.67µV) while the amplitude of N200 is low (*−*3*.*04µV), thereby resulting in 93.48% accuracy. In case of participant P6, during an observation interval between (16–18) min the amplitude of both P200 and N200 is low and is 2.31µV and *−*1*.*25µV, respectively, thereby resulting in 71.41% accuracy. In case of participant P1, during an observation interval between (6–8) min the amplitude of P200 is low (1.71µV) while the amplitude of N200 is high (*−*7*.*19µV), thereby resulting in 81.48% accuracy. Similarly, in case of participant P1, during observation interval between (0–2) min the amplitude of both P200 and N200 is high and is 10.48µV and *−*6*.*109µV, respectively, thereby resulting in 100% accuracy. Other values may be verified from the tables in the similar manner to conclude that both accuracy and ERPs (P200 and N200) are correlated to each other, such that, whenever the accuracy of detection is high there is a high amplitude of P200 and N200. In other words, we can say that whenever an individual successfully distinguishes the target stimuli amongst all other presented stimuli, the two ERPs, *viz.*, N200 and P200 are elicited with high amplitude. Besides, we also show the variation of accuracy and amplitude of P200 and N200 with time for participant P1 in Fig. 3. It can be easily observed from Fig. 3 that accuracy of target detection and amplitude of both ERPs are correlated.

**Fig. 3.** Plot showing variation of accuracy and amplitude of P200 and N200 with time

To study the variations in the ERPs under the presence of true alarm (when the user correctly identifies the target events) and false alarm (when the user incorrectly identifies the events), we plotted the P200 and N200 ERPs on a single graph with common origin (see Fig. 4). We observed that there is a considerable difference in the amplitude of both ERPs under true and false alarm conditions.

Further, Fig. 5 shows the variation in P200 and N200 ERPs under the presence of target and non-target stimuli.

Figure 6 depicts the variation in the active regions of the brain before, during and after the completion of the experiment. Here, blue spots visible in pre and post experiment scalp images show low brain activity, while the red spots visible during experimentation show an increase in the brain activity of the associated regions. Besides, from instance 1 we observed that during the experiment - parietal, frontal and some parts of occipital region were highly energized and these regions showed the presence of P200 and N200 ERPs. Further, from instance 2

**Fig. 4.** P200 and N200 peaks during true alarm and false alarm conditions

**Fig. 5.** Variation of ERPs in target and non-target conditions

**Fig. 6.** Variation of scalp plot before, during and after experiment (Color figure online)

we observed that during the experiment frontal region was highly energized and the region showed the presence of N200 ERP. Through this, we verified that our experiment successfully evokes the two ERPs from the designated regions. Hence, we can apply the selected ERPs for vigilance detection.

# **4 Conclusion**

In the literature different features of EEG signals have been utilized to study vigilance level of human beings. In this work, First, we successfully demonstrated that both P200 and N200 ERPs are suitable candidates for studying vigilance. Second, we observed the variation in P200 and N200 amplitude with true alarm and false alarm. Third, we observed the variation in P200 and N200 amplitude under the presence of target and non-target stimuli. Fourth, with the help of scalp plot of Fig. 6 we verified the hot-spots/active regions of the brain from where the studied ERPs originate.

This work may be applied in real-time analysis of vigilance. Besides, in future we plan to extend this work to quantize the level of vigilance instead of indicating its mere presence and absence.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### Author Index

Abdelrahman, Yomna 86 Abouelsadat, Wael 86 Antona, Margherita 137 Barnwal, Santosh Kumar 122 Bazzano, Federica 32, 60 Bellik, Yacine 8 Bhattacharya, Paritosh 109 Böck, Ronald 189 Casola, Silvia 60 Ceròn, Gabriel 60 Chander, Ajay 165 Clavel, Celine 8 Dam, Cathrine L. 165 Deb, Suman 109 Egorow, Olga 189 Eisa, Rana Mohamed 86 El-Shanwany, Yassin 86 Gaspardone, Marco 32 Gepperth, Alexander 19 Grimaldi, Angelo 32 Habiyaremye, Jean Luc 150 Handmann, Uwe 19 Kopinski, Thomas 19 Korozi, Maria 137 Lamberti, Fabrizio 32, 60 Leonidis, Asterios 137 Londoño, Jaime 60

Metin, Senem Kumova 178 Mitra, Pabitra 47 Montuschi, Paolo 60

Oulasvirta, Antti 3

Pal, Anindya 109 Paravati, Gianluca 32, 60

Rabha, Joytirmoy 47 Rajan, Rachel 73

Samanta, Debasis 47, 202 Samima, Shabnam 202 Sarkar, Ayanava 19 Sarma, Monalisa 47, 202 Siegert, Ingo 189 Sreeja, S. R. 47 Srinivasan, Ramya 165 Stephanidis, Constantine 137

Tanese, Flavio 60 Thekkan Devassy, Sunny 73 Tiwary, Uma Shanker 122 Tran, Cong Chi 150

Wei, Yingying 150 Wendemuth, Andreas 189

Yan, Shengyuan 150

© The Editor(s) (if applicable) and The Author(s) 2017. This book is an open access publication. Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.